I am very new to PLY and a bit more than a beginner to Python. I am trying to play around with PLY-3.4 and python 2.7 to learn it. Please see the code below. I am trying to create a token QTAG which is a string made of zero of more whitespaces followed by 'Q' or 'q', followed by '.' and a positive integer and one or more whitespaces. For example VALID QTAGs are
"Q.11 "
" Q.12 "
"q.13 "
'''
Q.14
'''
INVALID ones are
"asdf Q.15 "
"Q. 15 "
Here is my code:
import ply.lex as lex
class LqbLexer:
# List of token names. This is always required
tokens = [
'QTAG',
'INT'
]
# Regular expression rules for simple tokens
def t_QTAG(self,t):
r'^[ \t]*[Qq]\.[0-9]+\s+'
t.value = int(t.value.strip()[2:])
return t
# A regular expression rule with some action code
# Note addition of self parameter since we're in a class
def t_INT(self,t):
r'\d+'
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(self,t):
r'\n+'
print "Newline found"
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
t_ignore = ' \t'
# Error handling rule
def t_error(self,t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
# Build the lexer
def build(self,**kwargs):
self.lexer = lex.lex(debug=1,module=self, **kwargs)
# Test its output
def test(self,data):
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok: break
print tok
# test it
q = LqbLexer()
q.build()
#VALID inputs
q.test("Q.11 ")
q.test(" Q.12 ")
q.test("q.13 ")
q.test('''
Q.14
''')
# INVALID ones are
q.test("asdf Q.15 ")
q.test("Q. 15 ")
The output I get is as follows:
LexToken(QTAG,11,1,0)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,12,1,4)
LexToken(QTAG,13,1,0)
Newline found
Illegal character 'Q'
Illegal character '.'
LexToken(INT,14,2,6)
Newline found
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,7)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,4)
Notice that only the first and third of the valid inputs are correctly tokenized. I am not able to figure out why my other valid inputs are not being tokenized properly. In the doc string for t_QTAG:
'^'
with '\A'
did not work.'^'
. Then all the valid inputs get tokenized, but then second
invalid input also gets tokenized.Any help is appreciated in advance!
Thanks
PS: I joined the google-group ply-hack and tried posting there, but I could not post either directly in the forum or through email. I am not sure if the group is active anymore. Prof. Beazley is not responding either. Any ideas?
Finally I found the answer myself. Posting it so that others may find it useful.
As @Tadgh rightly pointed out t_ignore = ' \t'
consumes the spaces and tabs and hence I will not be able to match as per above regex for t_QTAG
and the consequence is that the second valid input is not tokenized. By reading the PLY documentation carefully, I learned that if the ordering of the regex for tokens is to be maintained then they have to be defined in a function rather than strings as was done for t_ignore
. If strings are used then PLY automatically orders them by longest to shortest length and appends them after the functions. Here t_ignore
is special, I guess, that it is somehow executed before anything else. This part is not clearly documented. The work around for this to define a function with a new token, eg, t_SPACETAB
, after t_QTAG
and just do not return anything. With this, all the valid inputs are correctly tokenized now, except the one with triple quotes (the multi-line string containing "Q.14"
). Also, the invalid ones are, as per specification, not tokenized.
Multi-line string problem: It turns out that internally PLY uses re
module. In that module, ^
is interpreted only at the beginning of a string and NOT beginning of every line, by default. To change that behavior, I need to turn on the multi-line flag, which can be done within the regex using (?m)
. So, to process all the valid and invalid strings in my test properly, the correct regex is:
r'(?m)^\s*[Qq]\.[0-9]+\s+'
Here is the corrected code with some more tests added:
import ply.lex as lex
class LqbLexer:
# List of token names. This is always required
tokens = [
'QTAG',
'INT',
'SPACETAB'
]
# Regular expression rules for simple tokens
def t_QTAG(self,t):
# corrected regex
r'(?m)^\s*[Qq]\.[0-9]+\s+'
t.value = int(t.value.strip()[2:])
return t
# A regular expression rule with some action code
# Note addition of self parameter since we're in a class
def t_INT(self,t):
r'\d+'
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(self,t):
r'\n+'
print "Newline found"
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
# Instead of t_ignore = ' \t'
def t_SPACETAB(self,t):
r'[ \t]+'
print "Space(s) and/or tab(s)"
# Error handling rule
def t_error(self,t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
# Build the lexer
def build(self,**kwargs):
self.lexer = lex.lex(debug=1,module=self, **kwargs)
# Test its output
def test(self,data):
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok: break
print tok
# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test(" Q.12 ")
q.test("q.13 ")
q.test("""
Q.14
""")
q.test("""
qewr
dhdhg
dfhg
Q.15 asda
""")
# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q. 17 ")
Here is the output:
-============Testing some VALID inputs===========-
LexToken(QTAG,11,1,0)
LexToken(QTAG,12,1,0)
LexToken(QTAG,13,1,0)
LexToken(QTAG,14,1,0)
Newline found
Illegal character 'q'
Illegal character 'e'
Illegal character 'w'
Illegal character 'r'
Newline found
Illegal character 'd'
Illegal character 'h'
Illegal character 'd'
Illegal character 'h'
Illegal character 'g'
Newline found
Illegal character 'd'
Illegal character 'f'
Illegal character 'h'
Illegal character 'g'
Newline found
LexToken(QTAG,15,6,18)
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'a'
Newline found
-============Testing some INVALID inputs===========-
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,16,8,7)
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
Space(s) and/or tab(s)
LexToken(INT,17,8,4)
Space(s) and/or tab(s)