Search code examples
pythontokenize

How to use tokezine/untokenize?


I am attempting to rebuild a line of python code after changing some elements using Python's tokenize module. A simple tokenize/untokenize does not rebuild the original code, it adds extra spaces in the output.

Is there a bug in untokenize or am I missing something?

from tokenize import tokenize, untokenize
from io import BytesIO


def retoken(text):
    result = []
    g = tokenize(BytesIO(text.encode('utf-8')).readline)  # tokenize the string
    for toknum, tokval, _, _, _ in g:
        result.append((toknum, tokval))
    return untokenize(result).decode('utf-8')


code = "x.y=12"
print("CODE:", code)
print("RETOKEN:", retoken(code))

Output:

CODE: x.y=12  
RETOKEN: x .y =12

Solution

  • The documentation for untokenize states that

    [...] the spacing between tokens (column positions) may change.

    I suspect that untokenize doesn't examine its entire list of tokens when building its output string. It appears to add a space after an identifier token, as that character is guaranteed to not be part of either the preceding identifier token or whatever token follows it, or to be a token itself. This helps ensure that the tokenize(untokenize(tokenize(s))) == tokenize(s), even though untokenize(tokenize(s)) may not equal s.