I am working with regex with both PHP and JavaScript. So, I was looking for some good tutorial. From this regex tutorial I have found an example for lookbehind that matches a certain 3 digit number only if it is preceded by the the word "USD". There are two different cases where a lookbehind is put both after and before the match.
Here are the regex patterns:
\d{3}(?<=USD\d{3}) //after the match
(?<=USD)\d{3} //before the match
The example string is:
USD100;
I grasped the idea but could not figure what is actually going on inside regex engine to complete the task. Can any one explain it to me easily so that I can understand. Thanks in advance.
The example below shows how PCRE (and most engines) implements look-behind. Take note of the position of the cursor just before entering the look-behind in each case.
In the case of \d{3}(?<=USD\d{3})
, note that the cursor advances 3 positions after matching \d{3}
, so the look-behind need to look past the 3 digits that it just consumes in order to check for USD in front of them.
This method makes sure that the numbers are there first, before checking for the prefix.
USD100;
^
Attempting to match \d{3}. Fail and bump along.
USD100;
^
Attempting to match \d{3}. Fail and bump along.
USD100;
^
Attempting to match \d{3}. Fail and bump along.
USD100;
^
Attempting to match \d{3}.
USD100;
^
Matched \d{3}. Attempting to assert (?<=USD\d{3}) (length 6).
USD100;
^ +
Save current position. Go back 6 characters.
(Attempt to match USD\d{3} succeeds, positive look-behind succeeds)
USD100;
^
Back to the saved position and report a match.
In the case of (?<=USD)\d{3}
, note that the cursor is right in front of 100
, so it only needs to look back 3 characters to check that USD is there.
This method makes sure that the prefix exists first, before matching the numbers.
USD100;
^
Attempting to assert (?<=USD) (length 3). Fail length check and bump along.
USD100;
^
Attempting to assert (?<=USD) (length 3). Fail length check and bump along.
USD100;
^
Attempting to assert (?<=USD) (length 3). Fail length check and bump along.
USD100;
^
Attempting to assert (?<=USD) (length 3).
USD100;
^ +
Save current position. Go back 3 characters.
(Attempt to match USD succeeds, positive look-behind succeeds)
USD100;
^
Back to the saved position. Attempting to match \d{3}.
USD100;
^
Matched \d{3} and report a match.
Look-behind is not a well-defined operation, so different engines have different implementations and limitations on what is allowed in look-behind.
.NET implements look-behind by matching the pattern inside look-behind from right-to-left. This makes it possible to put any construct inside look-behind, but since the tokens in the pattern are read from right-to-left, it is confusing when the pattern contains backreferences.
Other engines (PCRE included) chooses to match the pattern inside look-behind from left-to-right, by studying the pattern to determine the length of the pattern, and perform a match from the current position minus the length of the pattern. Since not all patterns have a bounded length, most engines reject such patterns to keep the performance reasonable.