Search code examples
pythonply

How to organize multiple functions returning same token with PLY?


Let's say I would like to have just one PLY token - 'INTEGER'. However I would like to be able to parse typical C-style literals in different bases, so effectively I would like to be able to parse strings like 0b10 (or 0B10), 010, 10 and 0x10 (or 0X10). As I don't really care what was the "input format", I would like to just have the value as int in Python.

However handling all these 4 cases in single function is not especially convenient... First or all the regex becomes pretty long: r'0[0-7]+|0[bB][01]+|0[xX][0-9a-fA-F]+|[0-9]'. But this is the smaller issue - the code of the function has to deal with multitudes of combinations to know which base to use, as the string starting with 0 in reality can be only single character, so checking further cases (next character is x, X, b or B) has to also take the length into account.

So I would just prefer to have that as 4 separate functions, but all returning the same 'INTEGER' type of token. I would prefer to not introduce BINARY_INTEGER, OCTAL_INTEGER, DECIMAL_INTEGER and HEXADECIMAL_INTEGER, because this would needlessly complicate the parser (or maybe I'm overthinking that?).

I was wondering whether there's something smarter to do than just forcing token.type to be 'INTEGER' in four "free" functions? Something other than:

def t_BINARY_LITERAL(t):
    r'0[bB][01]+'
    t.value = int(t.value[2:], 2)
    t.type = 'INTEGER'
    return t

def t_OCTAL_LITERAL(t):
    r'0[0-7]+'
    t.value = int(t.value[1:], 8)
    t.type = 'INTEGER'
    return t

def t_DECIMAL_LITERAL(t):
    r'[0-9]+'
    t.value = int(t.value, 10)
    t.type = 'INTEGER'
    return t

def t_HEXADECIMAL_LITERAL(t):
    r'0[xX][0-9a-fA-F]+'
    t.value = int(t.value[2:], 16)
    t.type = 'INTEGER'
    return t

Solution

  • Explicitly setting t.type is the correct solution. If you find it redundant, you could refactor into a conversion function:

    def send_int(t, offset, base):
      t.value = int(t.value[offset:], base)
      t.type  = 'INTEGER'
      return t
    
    def t_HEXADECIMAL_LITERAL(t):
      r'0[xX][0-9a-fA-F]+'
      return send_int(t, 2, 16)
    
    # etc.