Let's say I would like to have just one PLY token - 'INTEGER'
. However I would like to be able to parse typical C-style literals in different bases, so effectively I would like to be able to parse strings like 0b10
(or 0B10
), 010
, 10
and 0x10
(or 0X10
). As I don't really care what was the "input format", I would like to just have the value as int
in Python.
However handling all these 4 cases in single function is not especially convenient... First or all the regex becomes pretty long: r'0[0-7]+|0[bB][01]+|0[xX][0-9a-fA-F]+|[0-9]'
. But this is the smaller issue - the code of the function has to deal with multitudes of combinations to know which base to use, as the string starting with 0
in reality can be only single character, so checking further cases (next character is x
, X
, b
or B
) has to also take the length into account.
So I would just prefer to have that as 4 separate functions, but all returning the same 'INTEGER'
type of token. I would prefer to not introduce BINARY_INTEGER
, OCTAL_INTEGER
, DECIMAL_INTEGER
and HEXADECIMAL_INTEGER
, because this would needlessly complicate the parser (or maybe I'm overthinking that?).
I was wondering whether there's something smarter to do than just forcing token.type
to be 'INTEGER'
in four "free" functions? Something other than:
def t_BINARY_LITERAL(t):
r'0[bB][01]+'
t.value = int(t.value[2:], 2)
t.type = 'INTEGER'
return t
def t_OCTAL_LITERAL(t):
r'0[0-7]+'
t.value = int(t.value[1:], 8)
t.type = 'INTEGER'
return t
def t_DECIMAL_LITERAL(t):
r'[0-9]+'
t.value = int(t.value, 10)
t.type = 'INTEGER'
return t
def t_HEXADECIMAL_LITERAL(t):
r'0[xX][0-9a-fA-F]+'
t.value = int(t.value[2:], 16)
t.type = 'INTEGER'
return t
Explicitly setting t.type
is the correct solution. If you find it redundant, you could refactor into a conversion function:
def send_int(t, offset, base):
t.value = int(t.value[offset:], base)
t.type = 'INTEGER'
return t
def t_HEXADECIMAL_LITERAL(t):
r'0[xX][0-9a-fA-F]+'
return send_int(t, 2, 16)
# etc.