Search code examples
parsinggrammaryaccplyshift-reduce-conflict

how to write a PLY grammar to parse paths?


I'm trying to write a grammar with PLY that will parse paths in a file. I'm running into shift reduce conflicts and I'm not sure how to change the grammar to fix it. Here's an example of the file I'm trying to parse. The path/filename can be any acceptable linux path.

file : ../../dir/filename.txt
file : filename.txt
file : filename

So here is the grammar that I wrote.

header : ID COLON path

path : pathexpr filename

pathexpr : PERIOD PERIOD DIVIDE pathexpr
           | PERIOD DIVIDE pathexpr
           | ID DIVIDE pathexpr 
           |
filename : ID PERIOD ID
           | ID               

Here are my tokens. I am using the PLY included ctokens library. Just to save effort in writing my own.

t_ID = r'[A-Za-z_][A-Za-z0-9_]*'
t_PERIOD = r'\.'
t_DIVIDE = r'/'
t_COLON = r':'

So I believe there is a shift reduce conflict in the "filename" rule because the parser doesn't know whether to reduce the token to "ID" or to shift for "ID PERIOD ID". I think there is another issue with the case of no path ("filename") where it will consume the token in pathexpr instead of reducing to empty.

How can I fix my grammar to handle these cases? Maybe I need to change my tokens?


Solution

  • The simple solution: Use left-recursion instead of right-recursion.

    LR parsers (like PLY and yacc) prefer left-recursion, because it avoids having to expand the parser stack. It is also usually closer to the semantics of the expression -- which is useful when you want to actually interpret the language, and not just recognize it -- and it often, as in this case, avoids the need to left-factor.

    In this case, for example, each path segment needs to be applied to the preceding pathexpr, by looking for the segment directory inside the currently found directory. The parser action is clear: look up $2 in $1. How do you right the action for the right recursive version?

    So, a simple transformation:

    header   : ID COLON path
    
    path     : pathexpr filename
    
    pathexpr : pathexpr PERIOD PERIOD DIVIDE
             | pathexpr PERIOD DIVIDE
             | pathexpr ID DIVIDE
             |
    filename : ID PERIOD ID
             | ID