As part of an effort to write code that works consistently on both Python 2 and 3, I would like to test for any unadorned string literals (any opening " or ' not preceded by a b or u).
I'm fine with writing test cases, so I just need a function that returns all unadorned string literals across my .py files.
As an example, say I have Python code containing the following:
example_byte_string = b'This is a string of ASCII text or bytes'
example_unicode_string = u"This is a Unicode string"
example_unadorned_string = 'This string was not marked either way and would be treated as bytes in Python 2, but Unicode in Python 3'
example_unadorned_string2 = "This is what they call a 'string'!"
example_unadorned_string3 = 'John said "Is it really?" very loudly'
I want to find all of the strings that are not explicitly marked, like example_unadorned_string, so that I can mark them properly and therefore make them behave the same way when run in Python 2 and 3. It would also be good to accommodate quotes within strings, like example_unadorned_string2 and 3, as these shouldn't have u/b added to the internal quotes. Obviously long term we will drop Python 2 support and only Bytes will need explicit marking. This aligns with the approach recommended by python-future.org: http://python-future.org/automatic_conversion.html#separating-text-from-bytes
I can think of ways to do this with grep that are pretty nasty. AST looks potentially helpful, too. But I feel like somebody must have already solved this problem before, so thought I'd ask.
You might want to explore the tokenize
module (python2, python3). A rough Python 3 example would be something like this:
import tokenize
import token
def iter_unadorned_strings(f):
tokens = tokenize.tokenize(f.readline)
for t in tokens:
if t.type == token.STRING and t.string[0] in ['"', "'"]:
yield t
fname = r'code_file.py'
if __name__ == '__main__':
with open(fname, 'rb') as f:
for s in iter_unadorned_strings(f):
print(s.start, s.end, s.string)