Search code examples
pythontestingstring-literalspython-2to3

Find/Test for unadorned string literals (no b" or u") in Python


As part of an effort to write code that works consistently on both Python 2 and 3, I would like to test for any unadorned string literals (any opening " or ' not preceded by a b or u).

I'm fine with writing test cases, so I just need a function that returns all unadorned string literals across my .py files.

As an example, say I have Python code containing the following:

example_byte_string = b'This is a string of ASCII text or bytes'

example_unicode_string = u"This is a Unicode string"

example_unadorned_string = 'This string was not marked either way and would be treated as bytes in Python 2, but Unicode in Python 3'

example_unadorned_string2 = "This is what they call a 'string'!"

example_unadorned_string3 = 'John said "Is it really?" very loudly'

I want to find all of the strings that are not explicitly marked, like example_unadorned_string, so that I can mark them properly and therefore make them behave the same way when run in Python 2 and 3. It would also be good to accommodate quotes within strings, like example_unadorned_string2 and 3, as these shouldn't have u/b added to the internal quotes. Obviously long term we will drop Python 2 support and only Bytes will need explicit marking. This aligns with the approach recommended by python-future.org: http://python-future.org/automatic_conversion.html#separating-text-from-bytes

I can think of ways to do this with grep that are pretty nasty. AST looks potentially helpful, too. But I feel like somebody must have already solved this problem before, so thought I'd ask.


Solution

  • You might want to explore the tokenize module (python2, python3). A rough Python 3 example would be something like this:

    import tokenize
    import token
    
    def iter_unadorned_strings(f):
        tokens = tokenize.tokenize(f.readline)
        for t in tokens:
            if t.type == token.STRING and t.string[0] in ['"', "'"]:
                yield t
    
    fname = r'code_file.py'
    if __name__ == '__main__':
        with open(fname, 'rb') as f:
            for s in iter_unadorned_strings(f):
                print(s.start, s.end, s.string)