Search code examples
ply

Tokenizing a letter as an operator


I need to make a language that has variables in it, but it also needs the letter 'd' to be an operand that has a number on the right and maybe a number on the left. I thought that making sure the lexer checks for the letter first would give it precedence, but that doesn't happen and i don't know why.

from ply import lex, yacc

tokens=['INT', 'D', 'PLUS', 'MINUS', 'LPAR', 'RPAR', 'BIGGEST', 'SMALLEST', 'EQ', 'NAME']

t_PLUS     = r'\+'
t_MINUS    = r'\-'
t_LPAR     = r'\('
t_RPAR     = r'\)'
t_BIGGEST  = r'\!'
t_SMALLEST = r'\#'
t_D        = r'[dD]'
t_EQ       = r'\='
t_NAME     = r'[a-zA-Z_][a-zA-Z0-9_]*'

def t_INT(t):
    r'[0-9]\d*'
    t.value = int(t.value)
    return t


def t_newline(t):
    r'\n+'
    t.lexer.lineno += 1


t_ignore = ' \t'

def t_error(t):
    print("Not recognized by the lexer:", t.value)
    t.lexer.skip(1)

lexer = lex.lex()

while True:
    try: s = input(">> ")
    except EOFError: break
    lexer.input(s)
    while True:
        t = lexer.token()
        if not t: break
        print(t)

If i write: 3d4 it outputs:

LexToken(INT,3,1,0)
LexToken(NAME,'d4',1,1)

and i don't know how to work around it.


Solution

  • Ply does not prioritize token variables by order of appearance; rather, it orders them in decreasing order by length (longest first). So your t_NAME pattern will come before t_D. This is explained in the Ply manual, along with a concrete example of how to handle reserved words (which may not apply in your case).

    If I understand correctly, the letter d cannot be an identifier, and neither can d followed by a number. It is not entirely clear to me whether you expect d2e to be a plausible identifier, but for simplicity I'm assuming that the answer is "No", in which case you can easily restrict the t_NAME regular expression by requiring an initial d to be followed by another letter:

    t_NAME = '([a-ce-zA-CE-Z_]|[dD][a-zA-Z_])[a-zA-Z0-9_]*'
    

    If you wanted to allow d2e to be a name, then you could go with:

    t_NAME = '([a-ce-zA-CE-Z_]|[dD][0-9]*[a-zA-Z_])[a-zA-Z0-9_]*'