Search code examples
pythonparsingyacclexply

Parsing command strings defined on lines with PLY


I'm new to the world of lexing and parsing so I hope this is an easy problem to solve. I'm trying to parse a file with groups of tokens that fall into different types with Python's PLY:

STRING STRING QUANTITY STRING STRING       # TypeA
STRING STRING STRING STRING STRING STRING  # TypeB
STRING STRING QUANTITY QUANTITY QUANTITY   # TypeC

Each line is supposed to be one type of command that my program understands. For example, let's call the type defined in the top line TypeA, the second line TypeB, and so on. Since there's supposed to be one command per line, the NEWLINE token at the end of each line indicates the end of a command. I successfully managed to tokenize the file with the following lexer:

# top level tokens
tokens = [
    'QUANTITY',
    'STRING',
    'NEWLINE'
]

# number, possibly in exponential notion, e.g. -1.5e-3.0, or SI suffix, e.g. 'k'
t_QUANTITY = r'[+-]?(\d+\.\d*|\d*\.\d+|\d+)([eE][+-]?\d*\.?\d*|[GMkmunpf])?'

# any group of 2 or more alphanumeric characters, with the first being a letter
t_STRING = r'[a-zA-Z_][a-zA-Z_0-9]*'

# ignore spaces and tabs
t_ignore = ' \t'

# ignore comments
t_ignore_COMMENT = r'\#.*'

# detect new lines
def t_newline(t):
    r'\n+'
    # generate newline token
    t.type = "NEWLINE"

    return t

I want to write a parser which will parse each matched command into different objects. I should end up with a list of the parsed objects.

I tried constructing the following rules:

def p_command(self, p):
    '''command : tokens NEWLINE
               | NEWLINE'''
    print("found command:", list(p))

def p_tokens(self, p):
    '''tokens : type_a_tokens
              | type_b_tokens
              | type_c_tokens'''
    p[0] = p[1]

def p_type_a_tokens(self, p):
    '''type_a_tokens : STRING STRING QUANTITY STRING STRING'''
    p[0] = "TypeA"

def p_type_b_tokens(self, p):
    '''type_b_tokens : STRING STRING STRING STRING STRING STRING'''
    p[0] = "TypeB"

def p_type_c_tokens(self, p):
    '''type_c_tokens : STRING STRING QUANTITY QUANTITY QUANTITY'''
    p[0] = "TypeC"

I get a SyntaxError for the token immediately after the first NEWLINE. Somehow the parser doesn't know to begin parsing a new command after it sees a pattern matching that of p_type_a_tokens.

Please can anyone shed some light on what should be a pretty simple set of parsing rules? Although the documentation for PLY is generally very good, all of the examples I've found so far are for calculators or programming languages where things like newlines don't apply.

Full source:

from ply import lex, yacc

class InputParser(object):
    # top level tokens
    tokens = [
        'QUANTITY',
        'STRING',
        'NEWLINE'
    ]

    t_QUANTITY = r'[+-]?(\d+\.\d*|\d*\.\d+|\d+)([eE][+-]?\d*\.?\d*|[GMkmunpf])?'
    t_STRING = r'[a-zA-Z_][a-zA-Z_0-9]*'

    # ignored characters
    t_ignore = ' \t'

    # ignore comments
    t_ignore_COMMENT = r'\#.*'

    def __init__(self, **kwargs):
        self.lexer = lex.lex(module=self, **kwargs)
        self.parser = yacc.yacc(module=self, **kwargs)

    # detect new lines
    def t_newline(self, t):
        r'\n+'
        # generate newline token
        t.type = "NEWLINE"

    # error handling
    def t_error(self, t):
        # anything that gets past the other filters
        print("Illegal character '%s' on line %i at position %i" %
              (t.value[0], self.lexer.lineno))

        # skip forward a character
        t.lexer.skip(1)

    # match commands on their own lines
    def p_command(self, p):
        '''command : tokens NEWLINE
                   | NEWLINE'''
        print("found command:", list(p))
        p[0] = p[1]

    def p_tokens(self, p):
        '''tokens : type_a_tokens
                  | type_b_tokens
                  | type_c_tokens'''
        p[0] = p[1]

    def p_type_a_tokens(self, p):
        '''type_a_tokens : STRING STRING QUANTITY STRING STRING'''
        print("found type a")
        p[0] = "TypeA"

    def p_type_b_tokens(self, p):
        '''type_b_tokens : STRING STRING STRING STRING STRING STRING'''
        print("found type b")
        p[0] = "TypeB"

    def p_type_c_tokens(self, p):
        '''type_c_tokens : STRING STRING QUANTITY QUANTITY QUANTITY'''
        print("found type c")
        p[0] = "TypeC"

    def p_error(self, p):
        if p:
            error_msg = "syntax error '%s'" % p.value
        else:
            error_msg = "syntax error at end of file"

        print(error_msg)

    def parse(self, text):
        self.parser.parse(text, lexer=self.lexer)

if __name__ == "__main__":
    parser = InputParser()
    parser.parse("""
a b 5.5 c d     # TypeA
e f 1.6 g h     # TypeA
i j k l m n     # TypeB
# empty line
o p -1 2.0 3e4  # TypeC
""")

Solution

  • The problem was caused by the fact that the first rule is special: this is where the parser starts. Since the first rule above cannot combine two commands (found on two adjacent lines), it fails.

    I fixed it by adding a new root rule, above p_command, which can take either a single command (for when the file contains only one command) or a list of commands (command_list):

    def p_command_list(self, p):
        '''command_list : command
                        | command_list command'''
        if len(p) == 3:
            self.commands.append(p[2])
        else:
            self.commands.append(p[1])
    

    (I also added a commands field to the class to hold the parsed commands)

    This can handle multiple commands being "merged" together as is found in my input file.