Regex in programming and Web development
Today we are going to write about Regular Expressions known as regex or shortened regexp, a very useful concept of using search patterns.
Surely you were in a situation when you need to replace some word for another or character for some else. This and more you can handle with Regular Expressions. Here you can read about everything you need to know how it works, how you can use Regex to help improve your search in the programming environment and web development. Let’s get started.
So what is REGEX?
Regular Expression is a sequence of characters with the ability to search through text, validate it against defined conditions or rules.
That sequence forms patterns which are used to match character combinations in string of text.
So the purpose of it is to do a simple or more complex match of text characters. It allows you to search for specific characters, words, interpunction etc. There are many uses for the search result. You can use Regex in order to do data validation, web scraping or if you want to do advanced find and replace operation, like for example if you want to change certain characters or get only email addresses from the document and much much more.
Regular expressions are used in search engines, and many programming languages have regex capabilities or implementation of its functionality(regex engine) either built-in or through libraries.
We will focus on presenting regular expression in JavaScript, in which it’s an object(class).
How to use Regex?
As we mentioned, Regular expression can be a single character or more complicated combination of characters in pattern.
Let's look at those characters that define the search.
Each character in a pattern is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. Together, metacharacters and literal characters can be used to identify text, pattern matches may vary from being very precise to being very general, controlled by the metacharacters.
For example in the regex 'm.' , m is a literal character that matches all ‘m’, and ‘.’ is a metacharacter that matches every character. Therefore, if we have text containing text “m0, me, mX”, our regex m. matches all three sets.
Dot . is a very general pattern, [a-z] (match all lower case letters from 'a' to 'z') is less general and a is a precise pattern (matches just 'a').
Regex Syntax and Metacharacters
Regular Expression is a string of text, composed of delimiters, pattern and optional modifiers.
/pattern/modifiers;
Such as this example: /ma-no.org/i
is a regular expression, ma-no.org is a pattern, i is a modifier (modifies the search to be case-insensitive).
The delimiter can be any character that is not a letter, number, backslash or space, that’s why most common is forward slash “/”, but when you have to search /, sometimes you can use other delimiters like # or ~.
Pattern is what is being searched for and the modifier sets where the search is happening or makes it case sensitive or insensitive.
We can construct complex expressions, combine them similar to arithmetics.
The idea is to make a small pattern of characters stand for a large number of possible strings, rather than compiling a large list of all the literal possibilities.
With this done, let's go to the next chapter.
Set of flags
What are those Regex flags? Those are modifiers behind ending delimiter. We can change how the expression is interpreted with them.
g
Global, perform a global match(continue after the first match through all given string).
i
Makes the whole pattern case insensitive. For example, /AbC/i would match aBc, ABC, abc, etc.
m
Multiline, beginning and end (^and $) will match only for end of line, instead of the whole string.
u
unicode, with this it is possible to extend unicode escapes.
y
sticky, the pattern will only match from its lastIndex position and ignores the global(g) flag.
s
dotall, period or dot(.) will match any character, including newline.
Regex for simple matching
Before we continue, not every regex will function in every programming language, you need to check it for yourself.
Now here are metacharacters and their definition with examples for JavaScript.
Period or Dot .
Wildcard, anything, except new line.
For example /a.b/
matches “a3b” but also “acb”, etc
Within [ ] the period or dot is literal.
Escape character \
Is used when you want to match special characters like ‘+; or ‘\’ or period.
Example
/.\./
finds anything that has a period behind the first searched character as the first period is a wildcard.
/\./
searches for a normal period, instead of wildcard period.
/\(?a/
will find character a with optional special character “(“ which before “a”.
Character classes matches any one of a set of characters.
/w
matches any word
/W
matches anything that IS NOT a word
/s
matches any whitespace characters, such as space and tab
/S
matches anything that IS NOT a space
/d
matches any digits (numbers)
/D
matches any character that IS NOT a digit
/b
matches any word boundary (this include spaces, dashes, commas, semicolons)
Pipe character(vertical bar) |
Is used like OR in programming, matches any one character separated by it.
For example /m|mouse/
finds text string that are either letter “m” OR the letters “mouse”
Exclamation mark !
Negates.
Caret symbol ^
Beginning of the line or text string in which were searching
Dollar symbol $
End of a statement (text string in which were searching)
Quantifiers
These symbols act as repeaters and the preceding characters are to be used for more than just one time.
Question mark ?
Optional character before question mark, like ‘-?’ dash would be optional
For example /ab?c/
will match ‘ab’ but also ‘abc’
Asterix *
Zero or more occurrences of the preceding character .
Examples :
/a.*b/
matches any string that contains ‘a’ and then the ‘b’ later, as there might be zero or more occurrences of period - as wildcard character.
/ab*c/
matches ‘ac’, ‘abc’, ‘abbbc’, etc.
/[xyz]*/
matches ‘’, ‘x’, ‘y’, ‘z’, ‘zx’, ‘zyx’, ‘xyzzy’, and so on.
/(ab)*/
matches ‘’, ‘ab’, ‘abab’, ‘ababab’, and so on.
Plus symbol +
Indicates one or more occurences of the preceding character
Example /[a]+/
will match both ‘a’ in word ‘Palma’.
Curly braces { }
Delimits a minimum and maximum number for characters in search/pattern, affects character before {} in search patterns, like /o{2,3}/
finds two oo in “school”, or /(c|r|a){2,3}/
finds “rat”, “cat”.
{min,}
→ preceding character may occur min or more times, example /{3}a/ which matches “aaa”.
{min,max}
→ preceding character may occur at least min times, but not more than max times.
Character grouping
Brackets [ ]
Inside we put characters we want to match in a search. By using lower or uppercase characters we can specify a range of matches, the forms can be mixed like [abcX-Z]
.
/[cmt]at/g
will find “cat, mat” but not “that”.
[a-z]
range of characters, in this case lowercase.
[^ ]
matches a single character that is not contained within the brackets for example /[^a-z]/
matches any single character that is not a lowercase letter from a to z.
Pattern group ()
The string inside parentheses can be recalled later. A marked subexpression is also called a block or capturing group.
Example :
/(p|P)/
searches for lowercase p OR uppercase P.
(?<name>)
- naming the group for later use.
Look aheads and look behinds
With these patterns we can find characters before or behind something, just dont freak out because of its naming.
/(?<=)./
Positive lookbehind. Any character that is preceding the characters we define behind it. Example /(?<=[Tt]he)./g
will find any character behind the characters The or the. /(?<=[h])a/g
will find any character behind the word h, for example a in word That.
/(?<!)./
Negative lookbehind. Anything that doesn't have characters we define behind it.
/.(?=)/
Positive lookahead. Any character that is before the characters defined. /.(?=[tT])/g
will find all the characters before the character T or t.
/.(?!)/
Negative lookahead. Find any characters that are not before characters defined. /.(?![ ])/g
will find all the characters that are not characters before space.
Lets go deeper into Regex
Now we can use what we learned to check for the phone number in a text.
Let us show you some advanced examples.
^[\t]+|[\t]+$
Matches excess whitespace at the beginning or end of a line
[+-]?(\d+(\.\d+)?|\.\d+)([eE][+-]?\d+)?
Matches any numeral
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Matches email address
^((\+\d{1,3}(-| )?\(?d\)?(-| )?\d{1,5})|(\(?\d{2,6}\)?))(-| )?(\d{3,4})(-| )?(\d{4})(( x| ext)\d{1,5}){0,1}$
Matches 9 digit phone number in EU, with country prefix in format 0034123456789.
^((\+1)[ -])?\(?(\d{3})\)?[ -]?(\d{3})[ -]?(\d{4})$
Matches a US phone number with international prefix +1.
Regex and JavaScript
There are two string methods.
search()
uses expression for a match and returns the position of the match.(note case insensitive modifier).
var string = "Ma-no.org!"; var result = string.search(/NO/i); //result = 3
replace()
returns a modified string where the pattern is replaced.(note positive lookahead)
var string2 = "Visit Palma!"; var result2 = string2.replace(/(?<=[ ])[a-zA-Z]{1,5}/g, "Pizzeria"); //result2 = "Visit Pizzeria!"
There are two ways to construct Regex as follows:
let myRegex = /ab+c/; let myRegexObject = new RegExp('ab+c');
In JavaScript, regular expressions are objects and are often used with the two predefined object methods.
test()
searches a string for a pattern, and returns true or false
<p id="link">Ma-no.org</p> <script> inputOutput = document.getElementById("demo").innerHTML; document.getElementById("demo").innerHTML = /e/.test(inputOutput); //true </script>
exec()
method searches a string for a specified pattern, and returns the found text as an object, if not found, returns null.
var obj = /M/.exec(inputOutput); document.getElementById("demo").innerHTML = "Found " + obj[0] + " in position " + obj.index + " in the text: " + obj.input; //Found M in position 0 in the text: Ma-no.org
match()
Returns an array containing all of the matches, including capturing groups, or null if no match is found.
const paragraph = 'Ma-No.org'; const regex = /[A-Z]/g; document.getElementById("demo").innerHTML = paragraph.match(regex); //M, N
matchAll()
Returns an iterator containing all of the matches, including capturing groups.
split()
Uses a regular expression or a fixed string to break a string into an array of substrings.
There are also quite useful properties like this one.
var pattern = /Ma-no.org/g; document.getElementById("demo").innerHTML = pattern.source; //Ma-no.org
Conclusion
Regexes are useful in a wide variety of text processing tasks, and more generally string processing.
Finally there are web apps or tool website playgrounds, great for practicing and testing.
We can only recommend these sites:
https://regex101.com/
https://regexr.com/
Javascript Regex cheatsheet
And the last one we recommend for checking browser compatibility.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions.
Image by CyberHades.