Search code examples
pythonparsinglexical-analysisplylexical

Does ply.lex parse the same token once?


I was reading a lexical parsing document so that I can parse some arguments and I exactly followed the document to create a parser. This is the whole code:

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import ply.lex as lex

args = ['[watashi]', '[anata]>500', '[kare]>400&&[kare]<800']

tokens = ('NUMBER', 'EXPRESSION', 'AND', 'LESS', 'MORE')

t_EXPRESSION = r'\[.*\]'
t_AND = r'&&'
t_LESS = r'<'
t_MORE = r'>'
t_ignore = '\t'

def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)
    return t

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

def t_error(t):
    print 'Illegal character "%s"' % t.value[0]
    t.lexer.skip(1)

lexer = lex.lex()

for i in args:
    lexer.input(i)
    while True:
        tok = lexer.token()
        if not tok: break
        print tok
    print '#############'

I simply created a list of sample arguments and I got this output:

LexToken(EXPRESSION,'[watashi]',1,0)
#############
LexToken(EXPRESSION,'[anata]',1,0)
LexToken(MORE,'>',1,7)
LexToken(NUMBER,500,1,8)
#############
LexToken(EXPRESSION,'[kare]>400&&[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
#############

The first and second sample arguments are parsed correctly, but the third one is not. The third sample argument is EXPRESSION+LESS+NUMBER whereas it must be EXPRESSION+MORE+NUMBER+AND+EXPRESSION+LESS+NUMBER. So I thought there could be one of those problems:

  • ply.lex is only parsing one token: In the codes above, ply.lex cannot parse two seperate expressions and it returns the latest token as its type. "[kare]>400&&[kare]" is EXPRESSION because it ends with the latest EXPRESSION token which is second [kare] and 800 is NUMBER because it is the latest NUMBER token.

    !!! OR !!!

  • There is a mistake in t_EXPRESSION variable: I defined this variable as "[.*]" to get all characters in those two brackets ([]). The first token of third sample argument is "[kare]>400&&[kare]" since it simply starts and ends with those brackets and contains .* (every single character) in them, but I thought the interpreter would stop in the first (]) character due to being first.

So I could not find a way to solve but asked here.

in general this is what I am struggling with

lexer.input("[kare]>400&&[kare]<800")
while True:
    tok = lexer.token()
    if not tok: break
    print tok

I get

LexToken(EXPRESSION,'[kare]>400&&[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)

but I expected something more like

LexToken(EXPRESSION,'[kare]',1.0)
LexToken(LESS,'>',?)
LexToken(NUMBER,400,?)
LexToken(AND,'&&',?)
LexToken(EXPRESSION,'[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)

Solution

  • I think I see your problem

    t_EXPRESSION = r'\[.*\]'
    

    is greedy and will match the biggest match it can ie '[kare]>400&&[kare]'

    instead try

    t_EXPRESSION = r'\[[^\]]*\]'
    

    this will match only one set since it looks for not open bracket([^\]]) instead of anything(.)

    you can also use not greedy matching

    t_EXPRESSION = r'\[.*?\]'
    

    the ? makes it match as few characters as possible rather than the maximum