PLY/YACC parsing conflicts on PDF

I'm trying to parse PDFs with PLY lex/yacc, and Ive hit a snag regarding yacc parsing rule that govern NUMBER tokens, arrays, and indirect_references.

The relevant source:

def p_value_list(t):
   r'''value_list : value_list value
                 | value'''
   if t.slice[0] == 'item_list':
       t[0] = {'type':'value_list', 'children' : t[0]['children'] + [t[1]] }
   else:
       t[0] = {'type':'value_list', 'children' : [t[1]] }
   pass
def p_value(t):
   r'''value : dictionary
            | array
            | indirect_reference
            | NUMBER
            | HEX
            | STREAM
            | TEXT
            | BOOL
            | empty'''
   t[0] = t[1]
   pass
def p_indirect_reference(t):
   r'''indirect_reference : NUMBER NUMBER KEY_R'''
   t[0] = {'type':'indirect_reference', 'children' : [t[1], t[2]] }
   pass
def p_array(t):
   r'''array : LBRACKET value_list RBRACKET'''
   t[0] = {'type':'array', 'children' : t[2] }
   pass

Which results in ambiguous rules regarding NUMBERS (do you have a list NUMBER NUMBER or do you have an indirect reference NUMBER NUMBER KEY_R) -- The errors I get range from unexpected NUMBER tokens when parsing the simple array [0 0 0]

ERROR: Error  : obj_list NUMBER NUMBER OBJ LTLT key_value_list ID LBRACKET NUMBER NUMBER . LexToken(NUMBER,'0',1,166273)

to errors like this

ERROR: Error  : obj_list NUMBER NUMBER OBJ LTLT key_value_list ID LBRACKET NUMBER NUMBER . LexToken(RBRACKET,']',1,88)

Which I assume is expecting a KEY_R token when really its an array of two NUMBER tokens: [ 486 173]

For the brave, PDF Spec and full source linked.

[1] https://github.com/opticaliqlusion/pypdf/blob/master/pypdf.py

[2] http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_1-7.pdf

Solution

It's not really an ambiguity; the grammar is perfectly unambiguous. However, it is not LR(1), and an LR(1) parser cannot figure out whether to shift or reduce an integer if it is followed by another integer. (The grammar is LR(2) since the second-next token is sufficient to decide; if it is R, the integer is the first token in an indirect reference and should be shifted.

Fixing the problem is a lot trickier. In theory, you can convert any LR(2) grammar to an LR(1) grammar, but the transformation is cumbersome and I don't know of any automated tool which does it. The basic idea is to extend productions so as to avoid the need for a reduction until enough context has been encountered, and then fix the parse tree as necessary.

(For some other possible solutions to this problem, see below. The last option presented is my personal favourite.)

Here's an illustration of the LR(2)→LR(1) technique; you can see how NUMBER tokens are simply accrued until their disposition is known:

value_not_number
        : dictionary
        | array
        | HEX
        | STREAM
        | TEXT
        | BOOL
        | empty
value_list
        : value_list_ends_non_number
        | value_list_ends_one_number
        | value_list_ends_two_numbers
        | value_list_ends_indref
        | 
value_list_ends_one_number
        : value_list_ends_non_number NUMBER
value_list_ends_two_numbers
        : value_list_ends_non_number NUMBER NUMBER
        | value_list_ends_two_numbers NUMBER
value_list_ends_indref
        : value_list_ends_two_numbers 'R'
value_list_ends_non_number
        : value_list value_not_number
array   : '[' value_list ']'

Note that the parse tree generated by the grammar is not quite accurate, since it will parse 0 0 R as a value_list_ends_two_numbers followed by an R. In order to retrieve the real parse tree, the reduction action for value_list_ends_indref needs to steal the last two numbers from its first child, use them to manufacture an indirect reference object, and then add that to the end of the first child. So the PLY grammar with actions might look like this:

def p_unit_productions(t):
    r'''value_not_number
            : dictionary
            | array
            | HEX
            | STREAM
            | TEXT
            | BOOL
        value_list
            : value_list_ends_non_number
            | value_list_ends_one_number
            | value_list_ends_two_numbers
            | value_list_ends_indref'''
    t[0] = t[1]

def p_value_list_empty(t):
    r'''value_list:'''
    t[0] = []

def p_value_list_push(t):
    r'''value_list_ends_one_number
            : value_list_ends_non_number NUMBER
        value_list_ends_two_numbers
            : value_list_ends_two_numbers NUMBER
        value_list_ends_non_number
            : value_list value_not_number'''
    t[0] = t[1]
    t[0].append(t[2])

def p_value_list_push2(t):
    r'''value_list_ends_two_numbers
            : value_list_ends_non_number NUMBER NUMBER'''
    t[0] = t[1]
    t[0].append(t[2])
    t[0].append(t[3])

def p_indirect_reference(t):
    r'''value_list_ends_indref
            : value_list_ends_two_numbers 'R' '''
    t[0] = t[1]
    gen = t[0].pop()
    obj = t[0].pop()
    t[0].append({'type': 'indirect_reference',
                 'children': [obj, gen] })

def p_array(t):
    r'''array : '[' value_list ']' '''
    t[0] = {'type':'array', 'children' : t[2] }

That's not quite the same as yours:

It doesn't bother with the value_list object type (which seemed to me unnecessary).
The grammar organization might seem a bit odd, since the functions I've defined are organized by action rather than by non-terminal, but PLY handles this fine (http://www.dabeaz.com/ply/ply.html#ply_nn25) and it leads to simpler actions.
I removed value: empty on the assumption that your empty is an empty non-terminal, as per the PLY manual; that will lead to a real ambiguity since you can't know how many empty strings are found between two non-terminals.
I replaced the single-character literal tokens with single-quoted symbols (http://www.dabeaz.com/ply/ply.html#ply_nn26), because I personally believe that it makes grammars more readable.

(I haven't tested that PLY fragment; let me know if there are problems. I could easily have made some typos.)

Other possible solutions

1. Use the lexical scanner

As suggested in a comment, one alternative is to use the lexical scanner to recognize the entire sequence (in this case an indirect reference) which creates the need for extended lookahead.

In effect, this uses the lexical scanner to implement a fall-back solution. This is a common solution when there are only a few LR(2) states in the grammar. For example, bison itself uses this technique to distinguish between the left-hand side of a production and the use of a symbol on the right-hand side. (You can tell the difference when you see -- or fail to see -- the colon after the symbol, but you have to know when the symbol itself is the first lookahead.)

In some cases that may seem like a simpler solution, but in other cases it results in simply exporting the complexity from the parser to the scanner. In this particular case, for example, you probably use the longest-match lexical disambiguation algorithm to distinguish between an R and an identifier token which starts with R, and you probably have a rule which ignores whitespace. Neither of those will help you with the rule to recognize indirect references; you need to explicitly match the whitespace, and you need to explicitly verify that whatever follows the R cannot form a longer identifier. That's not horribly complicated, but neither is it trivial; you'll need to do some work to create appropriate test coverage.

As the number of possible extended-lookahead states increases, you'll find this solution increasingly complicated. (Not that the parser solution outlined above is great, either.)

2. Use a GLR parser

Since the grammar itself is not ambiguous, you could use a GLR parser instead of an LR(1) parser, if you had a parser generator which could produce GLR parsers. The GLR algorithm can parse any context-free grammar, but at a cost: it is slightly slower in common cases and much slower in edge cases (in some grammars, it could require quadratic or even cubic time), and typical implementations delay semantic actions until the ambiguity is resolved, which means that they don't necessarily execute in the expected order.

In this case, there is a much stronger disincentive for this solution: PLY doesn't produce GLR grammars. However, bison does have a GLR option, and it is relatively simple to wrap a bison parser into a python module (which will quite possible be faster than the PLY parser, if that matters).

3. Use a stack instead of a CFG

PDF is derived from Postscript, which is a stack language. Stack languages generally don't require parsers; they have just a lexical scanner and a stack. Each token has an associated action; for many tokens (literals, for example), the action is simply to push the token onto the stack. Other tokens may have more complicated actions.

The ] token, for example, creates an array by popping the stack until it finds a [ and placing the popped elements into a newly-created array object, which it then pushes on the stack.

Similarly, the R token's action would be to pop two things off the top of the stack; check that they are numbers; and then construct an indirect reference object using the two numbers, which it would then push onto the stack.

PLY won't help you much in the construction of a PDF stack parser, but it will at least build the lexer for you, which is actually the most complicated part of the parser. There are not many stack operations required, and none of them are particularly challenging.