Search code examples
regexpcrebacktracking

RegEx debugging


I'm debugging a Regular Expression ^(A+)*B over a string AAAC (example from rexegg.com) by two separate debugging tools I have access:

  1. regex101.com
  2. RegexBuddy v4

Below is the results (regex101 in the left side):

tool outps (regex101 on the left)

The question I have is not mainly about the number of steps which is also important to me, but is how backtracks are done. Why do we see differences? (regex101 uses PCRE lib and I set RegexBuddy lib the same)

A comprehensive step by step explanation is really in my favor.


Solution

  • I wouldn't rely on either the number of steps or any debugging as a test
    of either backtracking or failure.

    Generally, keep away from simple constructs such as (A+)* that not only
    backtrack the inner A+ but backtrack the outter (..)* as well.
    Each pass of the outter produces a fresh (new) inner series of backtracking.

    Get a good benchmark software like RegexFormat
    and test a series for an exponential time result.
    A linear result is what you are looking for as the content increases by the same
    amount.

    Below is an example of (A+)? vs (A+)*. We set the target to a known failure
    and steadily increase the length to extend the processing of that failure.

    If you look at regex 2 (A+)* you can notice the exponential increase in just
    three target increases.
    Finally, we blow out the target to overload the internal resources of the engine.


    Edit: After some analysis, I post a meager conclusion on backtracking steps.
    By analysis of the benchmark time's below, backtracking appears to be a 2^N process.
    Where N is the number of character literals that are backtracked on failure.

    Given Revo's simple case, it's a bit easier to isolate the backtracking.
    Because it looks like %98 of the total time taken involves just backtracking.

    Given that assumption, one can take the time results from 2 points, and generate an equation.

    The form of the equation I used was f(x) = a * r^x.
    Given coordinates (# of 'A's vs. Time taken), using points
    (7, 1.13) , (9, 4.43) which generated this f(x) = 0.009475 * 1.9799 ^ x
    which is really this # sec's = 0.009475 * 1.9799 ^ # letters
    I ran all the number of letter 'A's from the benchmark's below into this formula.

    #LTR's   Bench Time
     7         1.13
     9         4.43
    13        70.51
    

    which produced the exact benchmark time ( +/- .1% ).

    I then saw that 1.9799 is pretty close to 2.0, so I adjusted the 'a' factor down to .009 and ran it again, getting this:

    f(7 letters) = 2 ^ 7 * .009 =  1.152 sec’s
    f(9 letters) = 2 ^ 9 * .009 =  4.608 sec’s
    f(13 letters) = 2 ^ 13 * .009 =  73.728 sec’s
    f(27 letters) = 2 ^ 27 * .009 =  1207959.552 sec’s = 335 hours  
    

    It seems pretty clear now that backtracking steps correlate to the number of
    letters to backtrack in a 2 ^ N relationship.

    I also think it's a fair bet that some engines can do this simple 2^N math
    only in simple scenario's like this one. These seem to be the times where
    the engine comes back immediately and says Too complex!.
    The translation here is that the regex is simple enough that the engine can
    detect it. Other times, the engine never comes back,
    it's hung, and may come back in a year or two (or ten..).

    Conclusion therefore is not if the engine will backtrack, it will, but how
    will your target string affect the backtracking.

    So, vigilance is required when writing regex, and must be on guard against
    nested open ended quantifiers.

    I'd say the best bet is always to hit a regex real hard to get it to fail.
    And I'm talking about excessive repetitive literals in suspect places.
    end edit


    Benchmark App

    Target: AAAAAAAC

    Benchmark

    Regex1:   ^(A+)?B
    Options:  < none >
    Completed iterations:   50  /  50     ( x 1000 )
    Matches found per iteration:   0
    Elapsed Time:    0.07 s,   72.04 ms,   72040 µs
    
    
    Regex2:   ^(A+)*B
    Options:  < none >
    Completed iterations:   50  /  50     ( x 1000 )
    Matches found per iteration:   0
    Elapsed Time:    1.13 s,   1126.44 ms,   1126444 µs
    

    Target: AAAAAAAAAC

    Benchmark

    Regex1:   ^(A+)?B
    Options:  < none >
    Completed iterations:   50  /  50     ( x 1000 )
    Matches found per iteration:   0
    Elapsed Time:    0.08 s,   82.43 ms,   82426 µs
    
    
    Regex2:   ^(A+)*B
    Options:  < none >
    Completed iterations:   50  /  50     ( x 1000 )
    Matches found per iteration:   0
    Elapsed Time:    4.43 s,   4433.19 ms,   4433188 µs
    

    Target: AAAAAAAAAAAAAC

    Benchmark

    Regex1:   ^(A+)?B
    Options:  < none >
    Completed iterations:   50  /  50     ( x 1000 )
    Matches found per iteration:   0
    Elapsed Time:    0.10 s,   104.02 ms,   104023 µs
    
    
    Regex2:   ^(A+)*B
    Options:  < none >
    Completed iterations:   50  /  50     ( x 1000 )
    Matches found per iteration:   0
    Elapsed Time:    70.51 s,   70509.32 ms,   70509321 µs
    

    Target: AAAAAAAAAAAAAAAAAAAAAAAAAAAC

    Benchmark

    Regex1:   ^(A+)?B
    Options:  < none >
    Completed iterations:   50  /  50     ( x 1000 )
    Matches found per iteration:   0
    Elapsed Time:    0.18 s,   184.05 ms,   184047 µs
    
    
    Regex2:   ^(A+)*B
    Options:  < none >
    Error:  Regex Runtime Error !!   
    Completed iterations:   0  /  50     ( x 1000 )
    Matches found per iteration:   -1
    Elapsed Time:    0.01 s,   7.38 ms,   7384 µs
    
    
    Error item 2 :  Target Operation .. 
    
    The complexity of matching the regular expression exceeded predefined
    bounds. Try refactoring the regular expression to make each choice made by
    the state machine unambiguous. This exception is thrown to prevent
    "eternal" matches that take an indefinite period time to locate.