Search code examples
pythonparsingplyanytree

Why is my PLY parser not grouping anytree nodes together properly?


I am trying to build a CAS in Python and am currently stuck on implementing the parser I will use to go from expression string to an Anytree tree I can eventually manipulate and simplify. The issue seems to be that, when parsing, yacc doesn't implement my GROUP node with it's proper children nodes as defined in my parser grammar spec. I've tried messing with precedences and associativities, switching up the order of the grammar rules but nothing seems to make it parent the nodes properly. What's even stranger is that in debug/verbose mode, it makes a node for the expression when it pattern-matches to it, but it (for some reason) fails to parent it to the GROUP node when it recognises an LPAREN expression RPAREN token

Here's my code:

import ply.yacc as yacc
from anytree import Node, RenderTree
import ply.lex as lex

#init token names
tokens = (
    'INTEGER',
    'PLUS',
    'MINUS',
    'TIMES',
    'DIVIDE',
    'LPAREN',
    'RPAREN',
)

#init regex rules for said tokens
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'

#init regex rule/function for integers
def t_INTEGER(t):
    r'\d+'
    t.value = int(t.value)
    return t

#ignoring whitespace
t_ignore = '\t'

#handling unknown characters (%s inserts whatever is past the second %)
def t_error(t):
    print("Illegal character '%s'" %t.value[0] )
    t.lexer.skip(1)

#building the lexer
lexer = lex.lex()


#parser grammar spec

#binary operators
def p_expression_binop(p):
    '''expression : expression PLUS expression
                  | expression MINUS expression
                  | expression TIMES expression
                  | expression DIVIDE expression '''
    p[0] = Node(p[2],children = [Node(p[1]),Node(p[3])])

#brackets/grouping
def p_expression_group(p):
    'expression : LPAREN expression RPAREN'
    p[0] = Node(p[1], children = [Node(p[2])])

#integers
def p_expression_number(p):
    'expression : INTEGER'
    p[0] = Node(p[1])


def p_error(p):
    print("Input syntax error ")

treeParser = yacc.yacc()

while True:
    try:
        s = input("Calculate this >")
    except EOFError:
        break
    if not s: break
    ParsTree = treeParser.parse(s)
    print(RenderTree(ParsTree))

Sample input: (2)+(2)

Sample Output:

Calculate this >(2)+(2)
Node('/+')
├── Node("/+/Node('/GROUP')")
└── Node("/+/Node('/GROUP')")

As you can see, it only creates GROUP nodes and does not make any child integer nodes under said GROUP nodes

Edit: Made the code self-contained and adding sample input and output to better explain the problem


Solution

  • The first argument to Node is expected to be a string. (In fact, it's expected to be a string which labels the node, but I'll get back to that in a minute.) If it's not a string, it's converted to one.

    So consider your action:

    p[0] = Node(p[2],children = [Node(p[1]),Node(p[3])])
    

    Here, p[2] is a token, whose value is a string (+, in the example). But p[1] and p[3] are both expression, and the value of an expression is a Node, not a string. So anytree converts it to a string, such as 'Node("\GROUP")'.

    Actually, when I tried the code you pasted, the result was 'Node("/(")', because the action for p_expression_group is:

    p[0] = Node(p[1], children = [Node(p[2])])
    

    I guess that in the code you ran, unlike the code you pasted into your question, that rule read:

    p[0] = Node("GROUP", children = [Node(p[2])])
    

    Or perhaps you had a different lexical action for \(. As a side-note, please always verify that the same output you paste into a question is really the output of the code you paste, and not of some other version of the same code, even if that's very similar, in order to avoid confusions.

    So the solution is to not call Node on things which are already Nodes. I changed the two relevant actions as follows:

    #binary operators
    def p_expression_binop(p):
        '''expression : expression PLUS expression
                      | expression MINUS expression
                      | expression TIMES expression
                      | expression DIVIDE expression '''
        p[0] = Node(p[2],children = [p[1], p[3]])  # Removed Node calls
    
    #brackets/grouping
    def p_expression_group(p):
        'expression : LPAREN expression RPAREN'
        p[0] = Node(p[1], children = [p[2]])       # Removed Node call
    

    And then got the "expected" result:

    $ python3 parser.py<<<'(2)+(2)'
    Generating LALR tables
    WARNING: 16 shift/reduce conflicts
    Calculate this >Node('/2')
    Node('/2')
    Node('/+')
    ├── Node('/+/(')
    │   └── Node('/+/(/2')
    └── Node('/+/(')
        └── Node('/+/(/2')
    

    You can see here that anytree has its own idea of how to present the label of a Node: it produces a "path" of labels from the parent, grandparent, etc. of the node, using / as a path separator. That definitely obscures your use of the label, which is effectively a operator name or a value. There's no real need for an AST node to have a unique human-readable label, which seems to be the anytree model. If you want to continue using anytree, you should consider adding other attributes, perhaps a nodetype attribute and a value or operator attribute, so that you can use the label field as a label (in case you find labels convenient for something).

    On the other hand, you might in the long run find that anytree is not a good match for you parsing problem. (Tree with parent nodes suffer from a number of oddities. For example, you can only put a node into one place, so you can't share it between different trees, or put more than one instance in the same tree. All of these use cases will require an explicit node copy.)