Search code examples
c#javapythonregexlookbehind

Why doesn't finite repetition in lookbehind work in some flavors?


I want to parse the 2 digits in the middle from a date in dd/mm/yy format but also allowing single digits for day and month.

This is what I came up with:

(?<=^[\d]{1,2}\/)[\d]{1,2}

I want a 1 or 2 digit number [\d]{1,2} with a 1 or 2 digit number and slash ^[\d]{1,2}\/ before it.

This doesn't work on many combinations, I have tested 10/10/10, 11/12/13, etc...

But to my surprise (?<=^\d\d\/)[\d]{1,2} worked.

But the [\d]{1,2} should also match if \d\d did, or am I wrong?


Solution

  • On lookbehind support

    Major regex flavors have varying supports for lookbehind differently; some imposes certain restrictions, and some doesn't even support it at all.

    • Javascript: not supported
    • Python: fixed length only
    • Java: finite length only
    • .NET: no restriction

    References


    Python

    In Python, where only fixed length lookbehind is supported, your original pattern raises an error because \d{1,2} obviously does not have a fixed length. You can "fix" this by alternating on two different fixed-length lookbehinds, e.g. something like this:

    (?<=^\d\/)\d{1,2}|(?<=^\d\d\/)\d{1,2}
    

    Or perhaps you can put both lookbehinds as alternates of a non-capturing group:

    (?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}
    

    (note that you can just use \d without the brackets).

    That said, it's probably much simpler to use a capturing group instead:

    ^\d{1,2}\/(\d{1,2})
    

    Note that findall returns what group 1 captures if you only have one group. Capturing group is more widely supported than lookbehind, and often leads to a more readable pattern (such as in this case).

    This snippet illustrates all of the above points:

    p = re.compile(r'(?:(?<=^\d\/)|(?<=^\d\d\/))\d{1,2}')
    
    print(p.findall("12/34/56"))   # "[34]"
    print(p.findall("1/23/45"))    # "[23]"
    
    p = re.compile(r'^\d{1,2}\/(\d{1,2})')
    
    print(p.findall("12/34/56"))   # "[34]"
    print(p.findall("1/23/45"))    # "[23]"
    
    p = re.compile(r'(?<=^\d{1,2}\/)\d{1,2}')
    # raise error("look-behind requires fixed-width pattern")
    

    References


    Java

    Java supports only finite-length lookbehind, so you can use \d{1,2} like in the original pattern. This is demonstrated by the following snippet:

        String text =
            "12/34/56 date\n" +
            "1/23/45 another date\n";
    
        Pattern p = Pattern.compile("(?m)(?<=^\\d{1,2}/)\\d{1,2}");
        Matcher m = p.matcher(text);
        while (m.find()) {
            System.out.println(m.group());
        } // "34", "23"
    

    Note that (?m) is the embedded Pattern.MULTILINE so that ^ matches the start of every line. Note also that since \ is an escape character for string literals, you must write "\\" to get one backslash in Java.


    C-Sharp

    C# supports full regex on lookbehind. The following snippet shows how you can use + repetition on a lookbehind:

    var text = @"
    1/23/45
    12/34/56
    123/45/67
    1234/56/78
    ";
    
    Regex r = new Regex(@"(?m)(?<=^\d+/)\d{1,2}");
    foreach (Match m in r.Matches(text)) {
      Console.WriteLine(m);
    } // "23", "34", "45", "56"
    

    Note that unlike Java, in C# you can use @-quoted string so that you don't have to escape \.

    For completeness, here's how you'd use the capturing group option in C#:

    Regex r = new Regex(@"(?m)^\d+/(\d{1,2})");
    foreach (Match m in r.Matches(text)) {
      Console.WriteLine("Matched [" + m + "]; month = " + m.Groups[1]);
    }
    

    Given the previous text, this prints:

    Matched [1/23]; month = 23
    Matched [12/34]; month = 34
    Matched [123/45]; month = 45
    Matched [1234/56]; month = 56
    

    Related questions