Search code examples
pythonparsinglexerply

Python parser ply does not handle spaces


I parse data using ply. I try to use a space as part of lexemes. Here a simplified example:

from ply.lex import lex
from ply.yacc import yacc

tokens = ('NUM', 'SPACE')

t_NUM = r'\d+'
t_SPACE = r' '

def t_error(t):
    print(f'Illegal character {t.value[0]!r}')
    t.lexer.skip(1)

lexer = lex()

def p_two(p):
    '''
    two : NUM SPACE NUM
    '''
    p[0] = ('two', p[1], p[2], p[3])

def p_error(p):
    if p:
        print(f"Syntax error at '{p.value}'")
    else:
        print("Syntax error at EOF")

parser = yacc()

ast = parser.parse('1 2')
print(ast)

When I run, I've got the error:

ERROR: Regular expression for rule 't_SPACE' matches empty string
Traceback (most recent call last):
  File "c:\demo\simple_space.py", line 19, in <module>
    lexer = lex()
  File "C:\demo\3rdparty\ply\ply\lex.py", line 752, in lex
    raise SyntaxError("Can't build lexer")
SyntaxError: Can't build lexer

Is it possible to specify space as part of lexeme? A few additional possible tokens:

  • t_COMMENT = r' \#.*' for comment
  • t_DIVIDE = r': +' for a divider

Solution

  • This is explained in the Ply manual section on Specification of tokens:

    Internally, lex.py uses the re module to do its pattern matching. Patterns are compiled using the re.VERBOSE flag which can be used to help readability. However, be aware that unescaped whitespace is ignored and comments are allowed in this mode. If your pattern involves whitespace, make sure you use \s. If you need to match the # character, use [#].

    So a literal space character must be written as [ ] or \ . (\s, as suggested in the manual, matches any whitespace, not just a space character.)