I am attempting to rebuild a line of python code after changing some elements using Python's tokenize module. A simple tokenize/untokenize does not rebuild the original code, it adds extra spaces in the output.
Is there a bug in untokenize
or am I missing something?
from tokenize import tokenize, untokenize
from io import BytesIO
def retoken(text):
result = []
g = tokenize(BytesIO(text.encode('utf-8')).readline) # tokenize the string
for toknum, tokval, _, _, _ in g:
result.append((toknum, tokval))
return untokenize(result).decode('utf-8')
code = "x.y=12"
print("CODE:", code)
print("RETOKEN:", retoken(code))
Output:
CODE: x.y=12
RETOKEN: x .y =12
The documentation for untokenize
states that
[...] the spacing between tokens (column positions) may change.
I suspect that untokenize
doesn't examine its entire list of tokens when building its output string. It appears to add a space after an identifier token, as that character is guaranteed to not be part of either the preceding identifier token or whatever token follows it, or to be a token itself. This helps ensure that the tokenize(untokenize(tokenize(s))) == tokenize(s)
, even though untokenize(tokenize(s))
may not equal s
.