Search code examples
pythonantlrcontext-free-grammarbnf

BNF Grammar for python style structures


I am trying to a simple grammar for parsing python like structures, this is what I could come up with for a list/set

list : '[' atom ( ',' atom)* ']'
set : '(' atom ( ',' atom)* ']'

atom : 'a'..'z' | 'A'..'Z'
     | '[' list ']'
     | '(' set ')'

Note that this is in antlr, I wanted to know about its correctness and any resources that would help me out

I did look at the python' grammar http://docs.python.org/reference/grammar.html but couldn't quite figure out it was handling list of lists or set of lists or list of sets etc..

Any help would appreciated.


Solution

  • couldn't quite figure out it was handling list of lists or set of lists or list of sets etc..

    It doesn't distinguish lists from sets or whatever:

    atom: ('(' [yield_expr|testlist_comp] ')' |
           '[' [listmaker] ']' |
           '{' [dictorsetmaker] '}' |
           '`' testlist1 '`' |
           NAME | NUMBER | STRING+)
    

    The way they handle recursion of the sort you're describing is that listmaker, dictorsetmaker etc. ultimately may contain atom. For example:

    listmaker: test ( list_for | (',' test)* [','] )
    test: or_test ['if' or_test 'else' test] | lambdef
    or_test: and_test ('or' and_test)*
    and_test: not_test ('and' not_test)*
    not_test: 'not' not_test | comparison
    comparison: expr (comp_op expr)*
    expr: xor_expr ('|' xor_expr)*
    xor_expr: and_expr ('^' and_expr)*
    and_expr: shift_expr ('&' shift_expr)*
    shift_expr: arith_expr (('<<'|'>>') arith_expr)*
    arith_expr: term (('+'|'-') term)*
    term: factor (('*'|'/'|'%'|'//') factor)*
    factor: ('+'|'-'|'~') factor | power
    power: atom trailer* ['**' factor]
    

    There are a lot of intermediates; that's because they need to establish precedence for a bunch of mathematical operators. Then there list_for, which allows for adding the extra stuff for a list comprehension.

    A much more simplified example might look like:

    atom: ('[' [list_or_set] ']' |
           '{' [list_or_set] '}' |
           NAME | NUMBER | STRING+)
    
    list_or_set: atom (',' atom)* [',']
    

    Or if you want the distinction between lists and sets to be made at this level:

    atom: list | set | NAME | NUMBER | STRING+
    list: '[' atom (',' atom)* [','] ']'
    set: '{' atom (',' atom)* [','] '}'