Search code examples
algorithmparsingpegpackrat

Packrat caching : Right to left vs. Left to right?


I'm currently trying to familiarize myself with packrat parsing. So I've read the PDF paper from 2002 linked here and in section 2.3 it describes packrat caching as a preliminary process (which occurs before the actual parsing) in which a full caching table is pre-constructed by reading the input from right to left. Only then, the actual linear parsing from left to right can start.

But in every PEG parser implementation I found, the "cache" option is usually a caching process that occurs during the actual left to right parsing. For example here.

Is there any difference between both approaches? Thank you.


Solution

  • I recently worked on similar research, met the exact same confusion, and resolved it. Regardless if you are still working on this topic, here's my answer.

    Your understanding is correct:

    • Packrat parser scans input string from left to right
    • Packrat parser construct the cache from right to left

    But there's just one approach, not two. Let's use one simple example Parsing Expression Grammar (PEG) without left-recursion: E -> Num + E | Num

    (Note that, a left-recursion example requires another long explanation, you can refer CPython's implementation for details)

    The Syntax Directed Translation (SDT) will be something like:

    E -> a=Num + b=E { a + b }
    E -> Num { Num }
    

    And we can write a parse_E function in below:

    def parse_E(idx):
        if idx in cache['parse_E']:
            return cache['parse_E'][idx]
    
        lval, nidx = parse_Char(idx)
        if nidx < len(self.tokens):
            operator, nnidx = parse_Char(nidx)
            if operator == '+':
                # E -> Num + E
                rval, nnnidx = parse_E(nnidx)
                cache['parse_E'][idx] = lval + rval, nnnidx
                return cache['parse_E'][idx]
        
        # E -> Num
        cache['parse_E'][idx] = lval, nidx
        return cache['parse_E'][idx]
    

    According to Byran Ford's paper, the parser needs to scan the input string from left to right and construct the cache in any position:

    for idx in len(input_string):
        parse_E(idx)
        parse_Char(idx)
    

    So, let's check the cache construction under the hood, initially, we have an empty cache and input string:

    cache: {'parse_E': {}, 'parse_Char': {}}
    input string: `2 + 3 + 4`
    

    The function call happens in the following order when idx=0. Clearly, we construct the cache from right to left at position 0 (not even to mention idx=1 or above).

    • parse_Char(Y) happens earlier than parse_Char(X) (Y > X)
    • parse_Char(X) must happens earlier than parse_E(X)
       parse_E(0)     ---   (E -> Num + E) (pending)
    -> parse_Char(0)  --- 2 (pending)
    -> parse_Char(1)  --- + (pending)
    -> parse_E(2)     --- E (E -> Num + E) (pending)
    -> parse_Char(2)  --- 3 (pending)       
    -> parse_Char(3)  --- + (pending)
    -> parse_E(4)     --- E (E -> Num) (pending)
    -> parse_Char(4)  --- 4 (acc)
    
    # Only after parse_Char(4) succeed and fill into cache, parse_E(4) can be successful...and so on.
    

    If you want to read the full Python example of Packrat parser implementation, you can check my repository. It contains a handmade Packrat parser and a CPython PEG generated Packrat parser based on a simple meta grammar.