Search code examples
pythonlexical-analysisply

how to define two tokens as one token?


I am trying to define two words separated by space as one token in my lexical analyzer but when I pass an input like in out it says LexToken(KEYIN,'in',1,0) and LexToken(KEYOUT,'out',1,3) I need it to be like this LexToken(KEYINOUT,'in out',1,0)

PS: KEYIN and KEYOUT are two different tokens as the grammar's definition

Following is the test which causes the problem:

import lex
reserved = {'in': 'KEYIN', 'out': 'KEYOUT', 'in\sout': 'KEYINOUT'} # the problem is in here

tokens = ['PLUS', 'MINUS', 'IDENTIFIER'] + list(reserved.values())

t_MINUS = r'-'
t_PLUS = r'\+'
t_ignore = ' \t'

def t_IDENTIFIER(t):
    r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
    t.type = reserved.get(t.value, 'IDENTIFIER')  # Check for reserved words
    return t


def t_error(t):
    print("Illegal character '%s'" % t.value[0], "at line", t.lexer.lineno, "at position", t.lexer.lexpos)
    t.lexer.skip(1)


lex.lex()
lex.input("in out inout + - ")
while True:
    tok = lex.token()
    print(tok)
    if not tok:
        break

Output:

LexToken(KEYIN,'in',1,0)
LexToken(KEYOUT,'out',1,3)
LexToken(IDENTIFIER,'inout',1,7)
LexToken(PLUS,'+',1,13)
LexToken(MINUS,'-',1,15)
None

Solution

  • This is your function which recognizes IDENTIFIERs and keywords:

    def t_IDENTIFIER(t):
        r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
        t.type = reserved.get(t.value, 'IDENTIFIER')  # Check for reserved words
        return t
    

    First, it is clear that the keywords it can recognize are precisely the keys of the dictionary reserved, which are:

    in
    out
    in\sout
    

    Since in out is not a key in that dictionary (in\sout is not the same string), it cannot be recognised as a keyword no matter what t.value happens to be.

    But t.value cannot be in out either, because t.value will always match the regular expression which controls t_IDENTIFIER:

    [a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*
    

    and that regular expression never matches anything with a space character. (That regular expression has various problems; the characters *, (, ), | and + inside the second character class are treated as ordinary characters. See below for a correct regex.)

    You could certainly match in out as a token in a manner similar to that suggested in your original question, prior to the edit. However,

    t_KEYINOUT = r'in\sout'
    

    will not work, because Ply does not use the common "maximum munch" algorithm for deciding which regular expression pattern to accept. Instead, it simply orders all of the patterns and picks the first one which matches, where the order consists of all of the tokenizing functions (in the order in which they are defined), followed by the token variables sorted in reverse order of regex length. Since t_IDENTIFIER is a function, it will be tried before the variable t_KEYINOUT. To ensure that t_KEYINOUT is tried first, it must be made into a function and placed before t_IDENTIFIER.

    However, that is still not exactly what you want, since it will tokenize

    in outwards
    

    as

    LexToken(KEYINOUT,'in out',1,0)
    LexToken(IDENTIFIER,'wards',1,6)
    

    rather than

    LexToken(KEYIN,'in',1,0)
    LexToken(IDENTIFIER,'outwards',1,3)
    

    To get the correct analysis, you need to ensure that in out only matches if out is a complete word; in other words, if there is a word boundary at the end of the match. So one solution is:

    reserved = {'in': 'KEYIN', 'out': 'KEYOUT'}
    
    def t_KEYINOUT(t):
        r'in\sout\b'
        return t
    
    def t_IDENTIFIER(t):
        r'[a-zA-Z][a-zA-Z0-9_]*'
        t.type = reserved.get(t.value, 'IDENTIFIER')  # Check for reserved words
        return t
    

    However, it is almost certainly not necessary for the lexer recognize in out as a single token. Since both in and out are keywords, it is easy to leave it to the parser to notice when they are used together as an in out designator:

    parameter: KEYIN IDENTIFIER
             | KEYOUT IDENTIFIER
             | KEYIN KEYOUT IDENTIFIER