Search code examples
regexperloptimizationtrie

How does Perl's regex implementation makes use of tries?


I'm exploring ways to optimize a stack-based backtracking regex implementation (see also this thread). In the comments a hint about Perl's regex matcher was given. From the source code I already figured that it makes use of tries when matching a regex with several alternatives.

For example, a regex like /(?:foo|bar|baz|foobar)$/ would normally yield a program like

BRANCH
  EXACT <foo>
BRANCH
  EXACT <bar>
BRANCH
  EXACT <baz>
BRANCH
  EXACT <foobar>
EOL

while the optimized version using a trie looks somewhat like

TRIE-EXACT
  <foo>
  <bar>
  <baz>
  <foobar>
EOL

The four branches with an EXACT command each were optimized into a single TRIE command.

However, I don't understand how this trie is used in the execution phase with respect to backtracking.

Consider the non-optimized code and the input string foobar. When trying to match the regex, the first branch foo would be successfully matched, the $ would fail, the next branches bar and baz would fail, the last branch foobar would match, $ would match, and the matching completes successfully.

Now how does the matching work with the trie? From my understanding a trie has only one sensible implied order in which overlapping entries are tried, namely the longest match first. This would give a correct match in the aforementioned example.

But it would give the wrong result for /(?:foo|foobar)barr$/ and the input foobarr. The trie would match foobar, but fail on the following r because it expects a b next. So there must not only be a way to backtrack in the trie, but also the original order of branches must be preserved somehow, otherwise the optimized program would be semantically different from the non-optimized version.

How does the Perl implementation handle this tree-based matching?


Solution

  • I wrote the original implementation of this code. It has changed somewhat since then, which has slightly changed the time/space tradeoffs for how it works.

    First, I need to call out one misunderstanding in your post, the rules for Perls regex engine are not to match the longest string, but rather to match the leftmost-longest string, in particular

    "foal"=~/(f|fo|foa|foal)/ 
    

    should match "f" not "foal" as the "f" option is first. This is called the "word order" in the comments in the code. You can see this with Perl by injecting (?{ print $& }) into your patterns:

    perl -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/'
    f
    foa
    fo 
    

    Another point I thought I should explain is that the code and comments use the term "state" and "accepting state" a lot. This comes from the observation that a Trie is equivalent to an acyclic DFA, and can be naively represented as a table of one row per state, with one column per legal character in the input alphabet and with an additional column being used to track whether the state is "accepting", meaning "ends one of the alternatives", and if so which branch of the alternation it terminates. The values in the other columns represent the new state to transition to after reading the columns character. An accepting state may have transitions to more states, as in the case where one branch matches a prefix of another.

    Matching then comes down to walking the tree/state table, when we encounter an accepting state we push a 2-tuple of the number of characters we have read so far in the trie and the word number associated with the state into a list. Eventually we either run out of string to read, or encounter a transition to the 0 state which means we can stop as no further matches are possible.

    Once you have a list of the tuples of (length,word) you can then go through them in order of word ascending, since such lists are generally short IMO it is reasonable to simply use linear probing to find the minimum word each try. Perl uses a strategy where it precompiles all such possible lists into one structure in advance, so it need only keep track of the branch number of the most recent accepting state, and then can (somewhat expensively) compute any preceeding accepting states from that. For instance the example below /(f|foa|fo)al/ produces a structure like this:

    word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)
    

    Which shows that if we reach the accepting state for "foa" which is branch 2 we have matched a 3 character string and there is a preceding accepting state of branch 3, which in turn points at 1. From this we can reconstruct the behavior we would expect to see, which is to try "al" after "f", "foa" and then "fo". The process is somewhat expensive (N**2) compared to other solutions, but has the advantage of being precompiled and reusable and the storage requirements fit nicely with how the regex backtracking engine works. IMO it doesnt matter, the main issue is that somehow you track the accepting states where you were in the string when you encountered them, and then try the "tail" in the right order.

    You can see much of this at work with extended debugging:

    perl -Mre=Debug,TRIE,TRIEE,TRIEM -le'my $str="foal"; $str=~/(?:f|foa|fo)(?{print $&})al/' 
    Compiling REx "(?:f|foa|fo)(?{print $&})al"
      Looking for TRIE'able sequences. Tail node is: EVAL
      - BRANCH (1) -> EXACT <f> => EVAL (First==-1,Last==-1,Cur==1,tt==END,nt==EXACT,nnt==END)
      - BRANCH (4) -> EXACT <foa>   => EVAL (First==1,Last==-1,Cur==4,tt==EXACT,nt==EXACT,nnt==END)
      - BRANCH (7) -> EXACT <fo>    => EVAL (First==1,Last==4,Cur==7,tt==EXACT,nt==EXACT,nnt==END)
      - TAIL (10) <SCAN FINISHED>
        make_trie start==1, first==1, last==10, tail==11 depth=1
        TRIE(NATIVE): W:3 C:6 Uq:3 Min:1 Max:3
        Compiling trie using table compiler
          Char :    f   o   a
          State+-------------
             1 :    2   .   . (   1)
             2 :    .   3   . (   1) W   1
             3 :    .   .   4 (   1) W   3
             4 :    .   .   . (   0) W   2
    

    Here you can see the DFA/TRIE equivalency and the word data, and how the table is constructed. Because there are only 3 legal characters we build a table that has 4 columns, one per each legal character, and one to track if the state is accepting and if so which branch of the alternation it accepts. The W 2 on row 4 shows that it is accepting for the 2nd branch of the alternation.

        Alloc: 22 Orig: 13 elements, Final:3. Savings of %76.92
        Statecount:5 Lasttrans:4
          Char : Match Base  Ofs     f   o   a
          State|-----------------------------------
          #   1|       @   3 + 0[    2   .   .]
          #   2| W   1 @   3 + 1[    .   3   .]
          #   3| W   3 @   3 + 2[    .   .   4]
          #   4| W   2 @   0 
    

    This shows the naive table representation being packed into a compressed form that allows the zero transition in one row to be used to store non-zero transitions in other rows. Each transition is stored in an array of 2-tuple of [owner-state,transition], with two side arrays, one which stores the start position of a given state, and another that stores the accepting mappings for a given state, state transitions involve checking if table[start_offset[state]+column_ofs].owner_state is the same as state, and if not then table[start_offset[state]+column_ofs].new_state can be assumed to be 0. See the Red Dragon for more details.

    
        word_info N:(prev,len)= 1:(0,1) 2:(3,3) 3:(1,2)
    

    this part of the output shows how we have precomputed the length, and the predecessor branch number for every accepting state. This way we don't have to construct a list of accepting states as we match, but instead can precompute all the possible such lists at compile time into one static structure.

    Final program:
       1: EXACT <f> (3)
       3: TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o] (11)
          <> 
          <oa> 
          <o> 
    

    Here you can see an interesting optimization kick in, because all of the alternations start with the letter "f" we "remove" it from the trie, and turn it into a prefix, the EXACT , followed by a trie containing the empty string, the letters "oa" and the letter "o". The S:2/4 means we start on state 2 not on the normal state 1, which is the state that corresponds to the letter 'f'. Extracting the 'f' like this allows it to be used elsewhere in the matching process, which means we can use less expensive machinery the Trie to find possible matches.

      11: EVAL (13)
      13: EXACT <al> (15)
      15: END (0)
    anchored "f" at 0 floating "al" at 1..3 (checking floating) minlen 3 with eval 
    Guessing start of match in sv for REx "(?:f|foa|fo)(?{print $&})al" against "foal"
    Found floating substr "al" at offset 2...
    Found anchored substr "f" at offset 0...
    

    The 'f' that was extracted earlier is being used to find the first possible place in the string the pattern could match.

    Guessed: match at offset 0
    Matching REx "(?:f|foa|fo)(?{print $&})al" against "foal"
       0 <> <foal>               |  1:EXACT <f>(3)
       1 <f> <oal>               |  3:TRIE-EXACT<S:2/4 W:3 L:0/2 C:6/3>[o](11)
       1 <f> <oal>               |    State:    2 Accepted: Y Charid:  2 CP:  6f After State:    3
       2 <fo> <al>               |    State:    3 Accepted: Y Charid:  3 CP:  61 After State:    4
       3 <foa> <l>               |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0
    

    Here we can see the walk through tree starting at state 2. Each "Accepted" line means we have found a possible match. The last line shows us reading a character not in our alphabet, meaning a transition to state 0 which means we can stop. Unfortunately this does not show the word numbers associated with each match, but you can infer them from the state transition table: 1,3,2, but we need to "try" them 1,2,3.

                                      got 3 possible matches
                                      TRIE matched word #1, continuing
       1 <f> <oal>               | 11:  EVAL(13)
    f
       1 <f> <oal>               | 13:  EXACT <al>(15)
                                        failed...
                                      TRIE matched word #2, continuing
       3 <foa> <l>               | 11:  EVAL(13)
    foa
       3 <foa> <l>               | 13:  EXACT <al>(15)
                                        failed...
                                      TRIE matched word #3, continuing
                                      only one match left, short-circuiting: #3 <o>
       2 <fo> <al>               | 11:EVAL(13)
    fo
       2 <fo> <al>               | 13:EXACT <al>(15)
       4 <foal> <>               | 15:END(0)
    Match successful!
    
    

    Note an explanation of different aspects of the Trie compilation process can be found at:

    https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l2407

    https://perl5.git.perl.org/perl.git/blob/HEAD:/regcomp.c#l4726