python parsing abstract-syntax-tree literals string-length

How to parse Python code while keeping string literals exactly as-is?

I'm trying to go through all the string literals in Python source code while being able to tell what kind of string literal each one is.

Unfortunately, as you can see in this example, ast.parse doesn't work:

[node.value.s for node in ast.parse('\'x\'; u\'x\'; b\'x\'; "x"; u"x"; b"x"').body]

The output is:

['x', 'x', b'x', 'x', 'x', b'x']

meaning that I can't distinguish between the '' and u'' literals, or the '' and "", etc.

How can I parse Python source code while maintaining the original literal exactly as written?

Is there a built-in way?

Solution

The information you're looking for isn't AST-level information. The appropriate level to inspect stuff like this is the token level, and you can use the tokenize module for that.

The tokenize API is pretty awkward - it wants an input that behaves like the readline method of a binary file-like object - so you'll need to open files in binary mode, and if you have a string, you'll need to use encode and io.BytesIO for conversion.

import tokenize
token_stream = tokenize.tokenize(input_file.readline)
for token in token_stream:
    if token.type == tokenize.STRING:
        do_whatever_with(token.string)

Here's the Python 2 version - the function names are different, and you have to access token information positionally, because you get regular tuples instead of namedtuples:

import tokenize
token_stream = tokenize.generate_tokens(input_file.readline)
for token_type, token_string, _, _, _ in token_stream:
    if token_type == tokenize.STRING:
        do_whatever_with(token_string)