I am using plyplus to design a simple grammar and I have been struggling with some weird error for a while. Please bear in mind I am a newbie. Here is a piece of code that reproduces the issue:
from plyplus import Grammar
list_parser = Grammar("""
start: context* ;
context : WORD '{' (rule)* '}' ;
rule: 'require' space_marker ;
space_marker: 'newline'
| 'tab'
| 'space'
;
WORD: '\w+' ;
SPACES: '[ \t\n]+' (%ignore) ;
""", auto_filter_tokens=False)
res = list_parser.parse("test { require tab }")
If my input string contains require space
or require newline
, it works perfectly fine. However, as soon as I provide require tab
, an exception is thrown:
Traceback (most recent call last):
File "/Users/bore/Projects/ThesisCode/CssCoco/coco/plytest.py", line 18, in <module>
res = list_parser.parse("test { require tab }")
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/plyplus/plyplus.py", line 584, in parse
return self._grammar.parse(text)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/plyplus/plyplus.py", line 668, in parse
raise ParseError('\n'.join(self.errors))
plyplus.plyplus.ParseError: Syntax error in input at 'tab' (type WORD) line 1 col 16
Ironically, I do not get this exception every time I run the code, but exactly once in three times. I noticed that if I change the grammar and the input from tab
to ta
, I get the same exception every time I run the code. Also, if I change it to tabb
, the error is gone.
The error suggests that tab
is parsed as a WORD instead of a space_marker. However, tabb
is also a WORD. From my trial and error it seems that plyplus is sensitive to the specific string I provide as a keyword. Am I missing something? Any help/hints/comments will be highly appreciated!
PlyPlus is an implementation of PLY, where L and Y stand for Lex and Yacc, so it is — for better of worse, probably worse — an LR parser, which works strictly bottom-up. This also means 'tab'
cannot be parsed as TAB
(or _ANON_X
, or whatever names it generates for the token) because of your very generous definition of WORD
. The only way around it is to make the definition more restrictive. For instance:
WORD: '\w+' (%unless
TAB: 'tab';
REQ: 'require';
);
My guess is that it works for 'newline'
and 'space'
because there is an implicitly defined preterminal somewhere which gets a higher priority assigned than the WORD
, but the documentation of PlyPlus is not exactly top class either, so one would have to look at the actual implementation of PlyPlus’s tokeniser.