Search code examples
pythonpyparsing

How to acquire Abstract Syntax Tree from ParseResults


I'm developing a translator for translating simple script on PC to some bytecode to execute it (the bytecode) on a microcontroller.

I've developed the translator in C++ using lex and re2c but Im considering switching to pyparsing.

In order to translate a statement of my script to few operations in bytecode I need to get the Abstract Syntax Tree of that statement.

I.E. this script:

X = 1 - 2;

Should be translated to binary equivalent of this:

register1 <- 1
register2 <- 2
register3 <- register1 - register2
x <- register3

I've got this python code:

integer = Combine( number )
ident = Word(alphas,alphanums) 

expr = Forward()
atom = ( integer | 
         ( lpar + expr.suppress() + rpar )
       )
expr << ( atom + (addop | multop) + atom )

statement = ident + assign + expr


L = statement..parseString( line )

Is there an example for visiting leafs of AST in L? Or something similar to that...

Thanks in advance


Solution

  • Your current parser will just give you a flat list of parsed tokens, since that is the default in pyparsing. The purpose is so that, regardless of how you build up your parser, whether in smaller pieces and then put them all together, or just in one giant statement, the tokens you get from parsing are structured (or not structured) the same. To get something akin to an AST, you need to define where you want structure using pyparsing's Group class (and I recommend using results names as well). So for example if you change statement to:

    statement = Group(ident("lhs") + '=' + Group(expr)("rhs"))
    

    Then your output will be much more predictable - every parsed statement will have 3 top-level elements - the target identifier (addressable as result.lhs), the '=' operator, and the source expression (addressable as result.rhs). The source expression may have further structure to it, but overall there will always be these 3 at the top-most level in every statement.

    To ensure the parenthetical groups in your RHS expression are retained when evaluating your expr, again, use a Group:

    atom = (integer | Group(lpar + expr + rpar))
    

    You can navigate the hierarchical structure of the parsed results as if you were walking a list of nested lists.

    But I would also encourage you to look at the SimpleBool example on the pyparsing wiki. In this example, the various parsed expressions get rendered into instances of classes which then can be processed using a visitor or just an iterator, and each class can then implement its own special logic for emitting your bytecode. Imagine that you had written a typical parser to generate an AST, then walked the AST to create CodeGenerator objects which subclass into AssignmentCodeGenerator or IfCodeGenerator or PrintCodeGenerator classes, and then walked this structure to create your bytecode. Instead, you can define assignment, if-then-else, or print statement expressions in pyparsing, have pyparsing create the classes directly, and then walk the classes to create the bytecode. In the end, your code is neatly organized into different statement types, and each type encapsulates the type of bytecode that it should output.