Search code examples
pythontoken

Lexeme evaluation in Python


Wikipedia makes a clear distinction between the concept of a lexeme and the concept of a token:

Lexing can be divided into two stages: the scanning, which segments the input string into syntactic units called lexemes and categorizes these into token classes; and the evaluating, which converts lexemes into processed values.

A lexeme, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser.

As I understand it, this means that a token is the result of a mapping <category, lexeme> to <category, value>, where the category in the Python case belongs to the set {identifier, keyword, literal, operator, delimiter, NEWLINE, INDENT, DEDENT}.

I want to understand better what the evaluation of a lexeme means in the case of the lexeme of category literal. Can we see the result of such evaluation by simply typing the lexeme in the Python REPL?

For example, if I type the following string literal lexeme

>>> '''some
... text'''

I get output 'some\n text' - can we call this string the value of the above lexeme (note the partial stripping of the quotes and insertion of the `\n' symbol)?

And if I type the following numeric literal lexeme

>>> 0b101

I get output 5 - can we call this number the value of the above lexeme?


Solution

  • Well, after some googling I found the official description of Python's tokenize module (unfortunately, there is no link to this page from the official Python reference on Lexical analysis).

    It says the following:

    Example of tokenizing from the command line. The script:
    def say_hello():
        print("Hello, World!")
    
    say_hello()
    

    will be tokenized to the following output where the first column is the range of the line/column coordinates where the token is found, the second column is the name of the token, and the final column is the value of the token (if any)

    $ python -m tokenize hello.py
    0,0-0,0:            ENCODING       'utf-8'
    1,0-1,3:            NAME           'def'
    1,4-1,13:           NAME           'say_hello'
    1,13-1,14:          OP             '('
    1,14-1,15:          OP             ')'
    1,15-1,16:          OP             ':'
    1,16-1,17:          NEWLINE        '\n'
    2,0-2,4:            INDENT         '    '
    2,4-2,9:            NAME           'print'
    2,9-2,10:           OP             '('
    2,10-2,25:          STRING         '"Hello, World!"'
    2,25-2,26:          OP             ')'
    2,26-2,27:          NEWLINE        '\n'
    3,0-3,1:            NL             '\n'
    4,0-4,0:            DEDENT         ''
    4,0-4,9:            NAME           'say_hello'
    4,9-4,10:           OP             '('
    4,10-4,11:          OP             ')'
    4,11-4,12:          NEWLINE        '\n'
    5,0-5,0:            ENDMARKER      ''
    

    So I think in Python (at least in the context of lexical analysis) we should interpret the token's value as a string representation produced by the Python tokenizer for that token. It is a different concept from the concept of "object's value" (the latter is abstract and well described by @Ulrich Eckhardt). This string representation is exactly the same as the corresponding lexeme in the source code. And in many cases it is different from the repr of the object (I mean the object that the interpreter matches to the token at program runtime – after parsing and bytecode compilation).

    When I say "exactly the same", I mean that there is absolutely no difference between a lexeme in the source code and the value of a token (element of a token tuple) that the tokenizer generates for that lexeme. In fact, there is even a method tokenize.untokenize which allows to do an exact reconstruction of the lexeme from its token ("The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured.) Tokenizer even saves the \ symbol in the string literal lexeme when it is used for the explicit line joining (more about this here) and in all other cases of its use in string literals.

    In other programming languages (other than Python), the token's value can be different from the original lexeme. In fact, Wikipedia says (it describes the concept of the token's value in the general programming context, i.e. not just for Python):

    Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing: only the type is needed. Similarly, sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. The evaluators for identifiers are usually simple (literally representing the identifier), but may include some unstropping. The evaluators for integer literals may pass the string on (deferring evaluation to the semantic analysis phase, or may perform evaluation themselves, which can be involved for different bases or floating point numbers. For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for an escaped string literal incorporates a lexer, which unescapes the escape sequences.

    For example, in the source code of a computer program, the string

    net_worth_future = (assets – liabilities);
    

    might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value:

    IDENTIFIER net_worth_future
    EQUALS
    OPEN_PARENTHESIS
    IDENTIFIER assets
    MINUS
    IDENTIFIER liabilities
    CLOSE_PARENTHESIS
    SEMICOLON
    

    P.S. "the value of the token" is more often called "token string" in the description of Python's tokenize module