Search code examples
parsinggrammarcontext-free-grammartext-parsing

Manually parsing a simple grammar


I need to implement a simple parser which would parse the following string: a[b[c,d],e],f[g,h[i], j[k,l]],...

In the example above there is a GROUP of one or more OBJECTS. Each object has nested OBJECTS(b in a AND h,j in f) plus nested VALUES(c,d,e,g,i,k,l).

So, it's something like:

GROUP : OBJECT (,OBJECT)*
OBJECT: NAME '[' OBJECTorVALUE (OBJECTorVALUE)* ']'
OBJECTorVALUE: OBJECT | VALUE
VALUE : NAME

What will be the easiest way to parse such grammar manually?

I tried to parse it with recursive descent parser and it's doable but looks weird because you have to resolve the production for OBJECTorVALUE:

OBJECTorVALUE: NAME '[' blabla ']'
OBJECTorVALUE: NAME

So, to make it LL grammar (to be parsed by rec-descent) we should resolve it to

OBJECTorVALUE: NAME Z
Z: eps | '[' blabla ']'

and now rec-desc parser gets some unnecessary complexity and feels unnatural, so I have to introduce one of rec-desc methods to look-ahead for Z after the name. So, instead of easy-peasy methods

parseGroup -> parseObjects 
parseObj -> parseName consume[ parseObjectValueGroup consume ]
parseObjectValueGroup -> parseObjectValues
parseObjectValue -> look('[') -> parseObj OR parseValue

I get the same but the last statement becomes

parseObjectValue -> parseName parseZ
parseZ -> look('[') -> parseObjWithoutName OR return empty
parseObjWithoutName -> ...

That looks messy to me, so can we use anything more simple here? I mean it's damn simple grammar which can be parsed even by string splits and looks as easy as it can be. Maybe Simple LR parser (SLR) will be more readable in the case?


Solution

  • I actually think trying to parse this top-down isn't going to be so bad. If you try writing this out as a formal grammar it looks a lot worse than it actually is. For example, here's what this might look like in pseudocode:

    ParseGroup():
        result = empty list
        while true:
            read token 'next'; it's some identifier.
            if (there's no next token) {
               append 'next' to result.
               break;
            }
            else if (the next token is '[') {
               let inner = ParseGroup();
               read token; it's a ']';
               append 'next[inner]' to result.
            } else {
               append 'next' to result.
            }
            read token; it's a comma.
            append ',' to result.
        }
    }
    

    This is basically a slightly-modified LL(1) parser for the modified grammar.