Search code examples
python-3.xparsingply

Looping a Parsing Step in PLY Python 3


I am currently using python 3 to parse a programming language that I am making. I am wondering how I can make a parsing rule loop.

That may be unclear, so here is an example.

I have the code:

def c_parse():
    def p_program(p):
        "program : actions"
        p[0] = ("PROGRAM", p[1])

    def p_actions(p):
        """actions : action
                   | actions action"""
        if len(p) == 3:
            p[0] = ("ACTIONS", p[1], p[2])
        elif len(p) == 2:
            p[0] = ("ACTION", p[1])

    def p_action(p):
        """action : function_call ';'
                  | variable_dec ';'
                  | if_statement ';'"""
        p[0] = ("ACTION_STATEMENT", p[1])

    ...

Given the input:

x = "HELLO WORLD";
print(x);
print(x);

This outputs this AST:

('PROGRAM', ('ACTIONS', ('ACTIONS', ('ACTION', ('ACTION_STATEMENT', 
('VARIABLE_DEC', ... ))), 
('ACTION_STATEMENT', ('FUNCTION_CALL', ... ))), ('ACTION_STATEMENT', 
('FUNCTION_CALL', ... ))))

Notice how the ACTIONS and ACTION_STATEMENT are extremely messy. This happens because of the recursive rule defined in the p_actions() function. Is there a way where I could just get a nice and clean AST like this:

(PROGRAM, (ACTION, (VARIABLE_DEC, ... )), (ACTION, (FUNCTION_CALL, ... )),
(ACTION, (FUNCTION_CALL, ...)))

In other words, can I make the p_actions() function be able to parse an infinite amount of ACTION nodes without using recursion? Is this possible?

BTW I'm on macOS if it matters.


Solution

  • If you are using a context-free grammar, you cannot parse input of unlimited length without using recursion, because recursion is the way you express unlimited repetition in a context-free grammar. That affects the formal syntax tree but there is absolutely no reason why your abstract syntax tree (AST) needs to preserve every detail of the parse.

    An AST is called abstract precisely because it abstracts away some details of the grammar which are not very useful for semantic analysis. You are free to create an AST in any manner you like; there are no arbitrary rules. (There is a heuristic: the AST should preserve exactly those features of the parse tree which are useful to you, and no more.)

    It is particularly common to remove unit productions from ASTs. (A unit production is a production whose right-hand side consists only of a single non-terminal, such as actions: action) when they do not add any useful information. Sometimes, a production with only one non-terminal on the right-hand side will be removed from the AST even if it is not strictly speaking a unit rule. This will depend on whether the production has semantic significance. For example, expression: '(' expression ')' is likely to be omitted, although expression: '-' expression is not.

    In Ply terms, omitting a production consists of simply passing through the value from the right-hand side. For example, you might have:

    def unit_rule(p):
      """actions : action
         program : actions
      """
      p[0] = p[1]   # Pass through the non-terminal's value
    
    def parens(p):
      """expr : LPAREN expr RPAREN"""
      p[0] = p[2]   # Pass through the non-terminal's value
    

    You are also not limited to just creating syntax nodes which faithfully mimic the grammatical structure. If you want a list to be represented in the AST as a list, that is just fine. Python's append operation can be pretty useful for this.

    So one possible way of getting the AST you seem to want would be:

    start = 'program'
    
    def p_actions(p):
        """actions : actions action"""
        p[1].append(p[2])
        p[0] = p[1]       
    
    def p_actions1(p):
        """actions : action"""
        p[0] = ["ACTIONS", p[1]]
    
    def p_unit(p):
        """program : actions"
           action  : function_call ';'
                   | variable_dec ';'
                   | if_statement ';'
        """
        p[0] = p[1]
    

    Some notes about the above code:

    1. I didn't see the point of ACTION nodes, so I just passed the statements themselves through in the last rule. Since actions consists only of ACTIONs, marking each element in the list as an ACTION seems redundant. But you can put in the ACTION if you want to.)

    2. In the above code, an ACTIONS node is a list, not a tuple; the p_actions function deliberately does not create a new list each time a new item is added. This is normally just fine, and saves a bunch of tuple creation and garbage collection. It assumes that that the value being passed through is not being used anywhere else, which is certainly the case here but might not always be true. However, if you really want tuples, that's not a problem:

      def p_actions(p):
          """actions : actions action"""
          p[0] = p[1] + (p[2],)
      
    3. Note that it is not necessary to put all the productions for a non-terminal in the same parse function. Productions can be grouped into functions if their actions are the same. On the whole, if you find yourself trying to figure out which production caused the action function to be triggered (if len(p) == 3, for example), then you might want to think about dividing the productions between two different functions, each of which has a non-conditional action.