Search code examples
javaregexregex-lookaroundslookbehind

positive lookbehind not behaving correctly


The code snippet for positive lookbehind is below

public class PositiveLookBehind {
    public static void main(String[] args) {
        String regex = "[a-z](?<=9)";
        String input = "a9es m9x us9s w9es";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(input);

        System.out.println("===starting====");
        while(matcher.find()) {
            System.out.println("found:"+matcher.group()
            +" start index:"+matcher.start()
            +" end index is "+matcher.end()); 
        }
        System.out.println("===ending=====");
    }
}

I was expecting that I should have 4 matches but to my surprise the output shows no match.

Can anyone point out my mistake?

As far as my understanding goes the regex here is alphabet preceded by digit 9 which is satisfied in 4 locations.


Solution

  • Problem

    Notice that (?<=9) is placed after [a-z]. What it means?

    Lets consider data like "a9c".

    At start regex-engine places its "cursor" at start of the string which it iterates, here:

    |a9c
    ^-regex cursor is here
    

    Then regex-engine is trying to match each part of regex-pattern from left to right. So in case of [a-z](?<=9) it first will try to find match for [a-z] and after successfully finding that match for it, it will try to move to evaluation of (?<=9) part.

    So match for [a-z] will happen here:

    a9c
    *<-- match for `[a-z]`
    

    After that match regex will move cursor here:

    a|9c *^--- regex-engine cursor ^---- match for [a-z]

    So now (?<=9) will be evaluated (notice position of cursor |). (?<=subregex) checks if immediately before cursor exist text which can be matched by subregex. But here since cursor is directly after a (?<=9) look-behind "sees"/includes that a as data which subexpression should test. But since a can't be matched by 9 evaluation fails.

    Solution(s)

    You probably wanted to check if 9 is placed before acceptable letter. To achieve that you can modify your regex in many ways:

    • with [a-z](?<=9.) you make look-behind test two previous characters

      a9c|
       ^^
       9. - `9` matches 9, `.` matches any character (one directly before cursor)
      
    • or simpler (?<=9)[a-z] to first look for 9 and then look for [a-z] which will let regex match 9c if cursor will be at 9|c.