The syntax of many programming languages requires that they be tokenized according to the "maximal munch" principle. That is, that tokens be built from the maximum possible number of characters from the input stream.
PLY's lexer does not seem to apply this principle. For example:
import ply.lex as lex
tokens = ('ASSIGNMENT', 'EQUALITY')
t_ASSIGNMENT = r'[+\-*/]?='
t_EQUALITY = r'=='
lexer = lex.lex()
lexer.input('==')
for tok in lexer:
print(tok)
According to "maximal munch", the output of this code should be LexToken(EQUALITY,'==',1,0)
, but it is LexToken(ASSIGNMENT,'=',1,0) LexToken(ASSIGNMENT,'=',1,1)
. This seems to be because the lexer prefers ASSIGNMENT
over EQUALITY
- prioritizing the longer regular expression rather than the longer match.
Is it possible to get PLY's lexer to follow the "maximal munch" principle?
If not, are there any guidelines for how tokens should be specified to avoid "less-than-maximal munch" situations such as the one above?
PLY uses Python's own re
package to match tokens, by building a single regular expression as a combination of alternatives. Since python's regular expression library is not maximal munch, neither is PLY.
Instead, the match selected is the first pattern in this big regular expression which matches, and the order is documented in the PLY nanual:
When building the master regular expression, rules are added in the following order:
All tokens defined by functions are added in the same order as they appear in the lexer file.
Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
Since the pattern which matches =
is longer, it is inserted earlier and ==
can't match.
To fix it, make the patterns functions and then order them as needed.