parsing python-3.x context-free-grammar ply python-plyplus

Plyplus gives syntax error because of specific keywords?

I am using plyplus to design a simple grammar and I have been struggling with some weird error for a while. Please bear in mind I am a newbie. Here is a piece of code that reproduces the issue:

from plyplus import Grammar

list_parser = Grammar("""
    start: context* ;
    context : WORD '{' (rule)* '}' ;
    rule: 'require' space_marker ;
    space_marker: 'newline'
        | 'tab'
        | 'space'
        ;

    WORD: '\w+' ;
    SPACES: '[ \t\n]+' (%ignore) ;
    """, auto_filter_tokens=False)

res = list_parser.parse("test { require tab }")

If my input string contains require space or require newline, it works perfectly fine. However, as soon as I provide require tab, an exception is thrown:

Traceback (most recent call last):
  File "/Users/bore/Projects/ThesisCode/CssCoco/coco/plytest.py", line 18, in <module>
    res = list_parser.parse("test { require tab }")
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/plyplus/plyplus.py", line 584, in parse
    return self._grammar.parse(text)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/plyplus/plyplus.py", line 668, in parse
    raise ParseError('\n'.join(self.errors))
plyplus.plyplus.ParseError: Syntax error in input at 'tab' (type WORD) line 1 col 16

Ironically, I do not get this exception every time I run the code, but exactly once in three times. I noticed that if I change the grammar and the input from tab to ta, I get the same exception every time I run the code. Also, if I change it to tabb, the error is gone.

The error suggests that tab is parsed as a WORD instead of a space_marker. However, tabb is also a WORD. From my trial and error it seems that plyplus is sensitive to the specific string I provide as a keyword. Am I missing something? Any help/hints/comments will be highly appreciated!

Solution

PlyPlus is an implementation of PLY, where L and Y stand for Lex and Yacc, so it is — for better of worse, probably worse — an LR parser, which works strictly bottom-up. This also means 'tab' cannot be parsed as TAB (or _ANON_X, or whatever names it generates for the token) because of your very generous definition of WORD. The only way around it is to make the definition more restrictive. For instance:

WORD: '\w+' (%unless
    TAB: 'tab';
    REQ: 'require';
  );

My guess is that it works for 'newline' and 'space' because there is an implicitly defined preterminal somewhere which gets a higher priority assigned than the WORD, but the documentation of PlyPlus is not exactly top class either, so one would have to look at the actual implementation of PlyPlus’s tokeniser.