Regex can be used to select everything between the specified characters. This can be useful for things like extracting contents of parentheses like (abc) or for extracting folder names from a file path (e.g. C:/documents/work/).
A regular expression that matches all characters between two specified characters makes use of look-ahead (?=…)Edit with Regexity and look-behind (?<=…)Edit with Regexity statements to isolate the string, and then uses the dot character .Edit with Regexity to select all contents between the delimiters.
An expression that does matches everything between aEdit with Regexity and bEdit with Regexity is:
/(?<=a).*(?=b)/gEdit with Regexity
Let’s discuss how it works:
How it Works
The expression starts with a positive look-behind (?<=…)Edit with Regexity which ensures that the matched string is preceded to whatever is in the place of …Edit with Regexity. In this case, we want to ensure that the letter aEdit with Regexity directly precedes the matched string.
/(?<=a)/Edit with Regexity
Look-aheads and look-behinds are assertive, which means that they are only used to check if a certain condition is true. Their contents (aEdit with Regexity in this case) are not matched.
After the presence of the aEdit with Regexity character, we want to match any character. This is denoted by the dot symbol .Edit with Regexity which will match any character except a newline character. On its own, the dot symbol will only match a single character, so we need to include a zero-or-more quantifier *Edit with Regexity behind it to ensure that we match zero or more of any character.
/(?<=a).*/Edit with Regexity
We want to stop matching when we encounter a bEdit with Regexity character. This is specified by a positive look-ahead (?=…)Edit with Regexity. This will ensure that the matched string is directly followed by whatever is in the place of …Edit with Regexity.
In this case, we use the character bEdit with Regexity inside the positive look-ahead:
/(?<=a).*(?=b)/Edit with Regexity
Finally, to return every instance of this match and not just the first, we include the global modifier gEdit with Regexity at the very end of the expression:
/(?<=a).*(?=b)/gEdit with Regexity
Match All Characters Greedy vs. Lazy
The following expression will match as many characters between aEdit with Regexity and bEdit with Regexity as it can. This is because the zero-or-more quantifier *Edit with Regexity is greedy.
/(?<=a).*(?=b)/gEdit with Regexity
This will produce the following matches:
another baby bathtub
Notice how it skips over three bEdit with Regexity characters and only stops the match right at the last bEdit with Regexity.
However, if we add a lazy identifier ?Edit with Regexity behind the zero-or-more quantifier, it makes the quantifier lazy, causing it to match as few characters as possible.
/(?<=a).*?(?=b)/gEdit with Regexity
This will produce the following matches:
another baby bathtub
Regex Match All Including Newline Characters
The expression above will match all characters between the two specified characters, except the newline character. To include the newline character in the match, we have several options.
This can be done by including the dotall modifier sEdit with Regexity (also called the single-line modifier) at the end, which treats the entire input text as a single line and therefore also matches newline characters.
/(?<=a).*(?=b)/gsEdit with Regexity
Some flavours of regex allow turning on the dotall modifier inside the expression using (?s)Edit with Regexity:
/(?s)(?<=a).*(?=b)/gEdit with Regexity
If the dotall modifier is not available in your flavour of regex, you can substitute the dot symbol .Edit with Regexity for [\s\S]Edit with Regexity enclosed in square brackets. This matches all whitespace characters \sEdit with Regexity (which include spaces, tabs, newlines, etc.) and all non-whitespace characters \SEdit with Regexity (which include letters, numbers, punctuation, etc.).
/(?<=a)[\s\S]*(?=b)/gEdit with Regexity
The square brackets indicate that we can match any of the characters in any order, and the zero-or-more quantifier *Edit with Regexity works just as before.
Match All Between Two Characters Without Lookarounds
Some flavours of regex do not support look-aheads and look-behinds at all. In these cases, we can use the following expression.
/a(.*)b/gEdit with Regexity
Here we used the dot symbol .Edit with Regexity together with the zero-or-more modifier *Edit with Regexity to match zero-or-more of any character. These are enclosed in parentheses ()Edit with Regexity to capture the contents for return it for later use.
Finally, this entire expression is sandwiched between the two characters we want to have matched, aEdit with Regexity and bEdit with Regexity in this case.
Note that this will expression will return the aEdit with Regexity and bEdit with Regexity together with the contents between them. However, the contents without aEdit with Regexity and bEdit with Regexity will be contained in the first capture group returned.
All the above modifications above be used on this expression. For example, newline characters can be included with:
/a([\s\S]*)b/gEdit with Regexity
Or the zero-or-more quantifier can be made lazy using the lazy indicator ?Edit with Regexity:
/a(.*?)b/gEdit with Regexity
Which Flags to Use
To extract all matches from the piece of text, and not just the first match, be sure to include the global modifier gEdit with Regexity at the end of the expression:
/(?<=a).*(?=b)/gEdit with Regexity
Since we are working with text here, you can also include the case insensitive modifier iEdit with Regexity to include matches regardless of their case.
Sources
The regular expressions on this page were adapted from solutions presented on Stack Overflow by Gopi posted on this question, by stema posted on this question, and by cletus posted on this question.
Excellent. Thank you.
Really good article. Thank you.