I'm trying to write a grammar with PLY that will parse paths in a file. I'm running into shift reduce conflicts and I'm not sure how to change the grammar to fix it. Here's an example of the file I'm trying to parse. The path/filename can be any acceptable linux path.
file : ../../dir/filename.txt
file : filename.txt
file : filename
So here is the grammar that I wrote.
header : ID COLON path
path : pathexpr filename
pathexpr : PERIOD PERIOD DIVIDE pathexpr
| PERIOD DIVIDE pathexpr
| ID DIVIDE pathexpr
|
filename : ID PERIOD ID
| ID
Here are my tokens. I am using the PLY included ctokens library. Just to save effort in writing my own.
t_ID = r'[A-Za-z_][A-Za-z0-9_]*'
t_PERIOD = r'\.'
t_DIVIDE = r'/'
t_COLON = r':'
So I believe there is a shift reduce conflict in the "filename" rule because the parser doesn't know whether to reduce the token to "ID" or to shift for "ID PERIOD ID". I think there is another issue with the case of no path ("filename") where it will consume the token in pathexpr instead of reducing to empty.
How can I fix my grammar to handle these cases? Maybe I need to change my tokens?
The simple solution: Use left-recursion instead of right-recursion.
LR parsers (like PLY and yacc) prefer left-recursion, because it avoids having to expand the parser stack. It is also usually closer to the semantics of the expression -- which is useful when you want to actually interpret the language, and not just recognize it -- and it often, as in this case, avoids the need to left-factor.
In this case, for example, each path segment needs to be applied to the preceding pathexpr
, by looking for the segment directory inside the currently found directory. The parser action is clear: look up $2 in $1. How do you right the action for the right recursive version?
So, a simple transformation:
header : ID COLON path
path : pathexpr filename
pathexpr : pathexpr PERIOD PERIOD DIVIDE
| pathexpr PERIOD DIVIDE
| pathexpr ID DIVIDE
|
filename : ID PERIOD ID
| ID