Search code examples
pythonyacclexplypushdown-automaton

Why does the grammar I defined not use tokens?


I'm working on define to a new language with lex and yacc. The lexer works fine but parser doesn't. I thought the problem is the grammar is not recognizing tokens but after lots of researches and trials I'm literally stuck. I'm not pretty sure if the grammar is completely right. I get "syntax error in input" from parser but there is no syntax error in data.

click to see LEX and YACC files

The data input is:

F 100

And here is another example for this grammar:

L 36 [L 4 [F 100 R 90] R 10]

My lexer (lexing.py) code:

import lex

tokens = (
    'NUMBER',
    'RED',
    'GREEN',
    'BLUE',
    'BLACK',
    'FORW',
    'RIGHT',
    'LOOP',
    'COLOR',
    'PEN',
    'LSQB',
    'RSQB',
    'EMPTY'
) 

t_FORW    = r'F'
t_RIGHT   = r'R'
t_LOOP   = r'L'
t_COLOR  = r'COLOR'
t_PEN  = r'PEN'
t_LSQB  = r'\['
t_RSQB  = r'\]'
t_RED  = r'K'
t_GREEN  = r'Y'
t_BLUE  = r'M'
t_BLACK  = r'S'
t_EMPTY = r'\ ' 

def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)    
    return t 

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value) 

t_ignore  = ' \t' 

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1) 

lexer = lex.lex()


data = '''
F 100
''' 

lexer.input(data)
 
for tok in lexer:
    print(tok)

And here is parsing code:

import yacc
from lexing import tokens


def p_root(p):
    '''root : function NUMBER option
            | COLOR colors option
            | PEN NUMBER option '''
    
def p_option(p):
    '''option : root
              | LSQB root RSQB root
              | EMPTY '''
def p_function(p):
    '''function : FORW 
                | RIGHT 
                | LOOP '''
    
def p_colors(p):
    '''colors : RED 
              | BLUE 
              | GREEN 
              | BLACK ''' 
              
def p_error(p):
    print("Syntax error in input!")

from lexing import data

# Build the parser

parser=yacc.yacc()
result=parser.parse(data)
#print (result)

I tried every way that I know. As you see, I didn't wrote the p arguments yet but trying to write that is not the solution. What could be the real problem?


Solution

  • The immediate problem is your use of EMPTY as a token (a single space character). That definition conflicts with your list of ignored characters in t_ignore, which is also going to be a problem. But note that there is no space character at the end of your input (the input ends with a newline, which is ignored), and your grammar requires option to end with EMPTY. That's guaranteed to produce a syntax error. (In the first version of this answer, I said that t_ignore is overridden by explicit token patterns, but it turns out that I was wrong. You can use an ignored character inside a rule, but not at the beginning; tokens which start with an ignored character will never be matched.)

    Particularly if this is your first project, you should follow a more systematic debugging technique. First ensure that the input is tokenised in the way you expect it to be, without trying to parse the token stream. When you do start parsing the stream, make sure that any syntax errors report the token which produced the error, including its location in the input.

    Both the lexer and the parser can be called with the keyword argument debug=True, which will provide a debugging log of parser actions. That should definitely be part of your debugging.

    All that said, your grammar strikes me as not very useful for determining the structure of the input. A good grammar reads the same way as you would describe the input in your own language, which might include descriptions like:

    • the input is a list of commands.
    • A command could be a Forward command, a right command, ..., or a loop command.
    • A forward command is an F followed by a number.
    • A loop command is an L followed by a number followed by a list of commands inside [ and ].