Search code examples
antlrantlr3abstract-syntax-treeantlrworks

Resolving possible ANTLR grammar ambiguities (And general improvement tips)


I'm having an issue building a grammar which is capable of parsing Python 3's AST dump format and converting it into an AST format that's easier for me to play around with. I decided to write an ANTLR grammar to do so, but I'm having an issue processing keyword blocks (but only keyword blocks, for some reason). I isolated out the keyword grammar, as shown:

grammar kwds;
options {output=AST;}

keywords:   'keywords=['((', '?)keyword)*']' -> keyword*
    ;

keyword :   'keyword(arg='STRING', value='str')'
    ;
str :   'Str(s='STRING')' -> STRING
    ;

STRING
    :  '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
    ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

EMPTYBRACKETS
    :   '[]';

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

This is designed to accept a list of keywords (0 or more with comma separators), with the format shown in the keyword rule.

If you give the above grammar the following (valid) input,

keywords=[keyword(arg='name', value=Str(s='UGA')), keyword(arg='rank', value=Str(s='2'))]

the grammar will recognize this, as it should.

However, using the "complete" python 3 AST format grammar that I've written (found at http://pastebin.com/ETrSVXvf for space-saving purposes, with the two above rules found on lines 106 and 109, respectively), which uses virtually the exact same grammar rules, the token stream seems to be a few characters off after parsing the first keyword match from the sample shown above, producing the following output when parsing against the keywords rule:

sample3.txt line 1:52 mismatched character 'e' expecting 'w'
sample3.txt line 1:53 no viable alternative at character 'y'
sample3.txt line 1:54 no viable alternative at character 'w'
sample3.txt line 1:55 no viable alternative at character 'o'
sample3.txt line 1:56 no viable alternative at character 'r'
sample3.txt line 1:57 no viable alternative at character 'd'
sample3.txt line 1:58 no viable alternative at character '('
sample3.txt line 1:59 missing ENDBR at 'arg='

I can only think of one possibility for this occurring: Something is being tokenized incorrectly due to an ambiguity in the grammar, since the pattern I used to detect multiple keyword statements works for other types of statements. However, I'm completely stuck as to where that ambiguity actually is in the grammar.

Also, any general improvement tips as to how I could improve my grammar in general would be appreciated!


Solution

  • If you add the rule:

    parse
     : (t=. {System.out.printf("type=\%-20s text='\%s'\n", tokenNames[$t.type], $t.text);})* EOF
     ;
    

    which simply matches zero or more tokens and prints out the type and text of these tokens, you will see that the lexer cannot cope with the input , keyword from your example:

    keywords=[keyword(arg='name', value=Str(s='UGA')), keyword(arg='rank', value=Str(s='2'))]
                                                     ^^^^^^^^^
    

    So there is no issue in one of your parser rules, but things go wrong on a lexical level.

    I advise you to remove all literal tokens from your parser and create lexer rule for them. Then add a single parse rule as I posted above with which you can test the lexer to see if the proper tokens are created. Once the proper tokens have been created, write your parser rules.

    I'm pretty sure the problem here is the fact that you don't have a ', keyword' token and that, once the lexer "sees" ', k', it tries to create a ', kwargs' token which fails, of course. So I also recommend you do not include the comma's and spaces in your tokens but let them be tokens of their own.

    Also, you don't want to have a rewrite rule like this:

    stmtlist:       ((', '?) stmt)* -> stmt*
            ;
    

    which can potentially match nothing. If that happens, ANTLR will throw an exception while creating the AST. Always let a rewrite rule produce something:

    ...
    
    tokens {
      ...
      STMTLST;
      ...
    }
    
    ...
    
    stmtlist:       ((', '?) stmt)* -> ^(STMTLST stmt*)
            ;