I'm trying to go through all the string literals in Python source code while being able to tell what kind of string literal each one is.
Unfortunately, as you can see in this example, ast.parse
doesn't work:
[node.value.s for node in ast.parse('\'x\'; u\'x\'; b\'x\'; "x"; u"x"; b"x"').body]
The output is:
['x', 'x', b'x', 'x', 'x', b'x']
meaning that I can't distinguish between the ''
and u''
literals, or the ''
and ""
, etc.
How can I parse Python source code while maintaining the original literal exactly as written?
Is there a built-in way?
The information you're looking for isn't AST-level information. The appropriate level to inspect stuff like this is the token level, and you can use the tokenize
module for that.
The tokenize
API is pretty awkward - it wants an input that behaves like the readline
method of a binary file-like object - so you'll need to open files in binary mode, and if you have a string, you'll need to use encode
and io.BytesIO
for conversion.
import tokenize
token_stream = tokenize.tokenize(input_file.readline)
for token in token_stream:
if token.type == tokenize.STRING:
do_whatever_with(token.string)
Here's the Python 2 version - the function names are different, and you have to access token information positionally, because you get regular tuples instead of namedtuples:
import tokenize
token_stream = tokenize.generate_tokens(input_file.readline)
for token_type, token_string, _, _, _ in token_stream:
if token_type == tokenize.STRING:
do_whatever_with(token_string)