Search code examples
pythonunicodeunicode-literals

How to ensure all string literals are unicode in python


I have a fairly large python code base to go through. It's got an issue where some string literals are strings and others are unicode. And this causes bugs. I am trying to convert everything to unicode. I was wondering if there is a tool that can convert all literals to unicode. I.e. if it found something like this:

print "result code %d" % result['code']

to:

print u"result code %d" % result[u'code']

If it helps I use PyCharm (in case there is an extension that does this), however I am would be happy to use a command like too as well. Hopefully such a tool exists.


Solution

  • You can use tokenize.generate_tokens break the string representation of Python code into tokens. tokenize also classifies the tokens for you. Thus you can identify string literals in Python code.

    It is then not hard to manipulate the tokens, adding 'u' where desired:


    import tokenize
    import token
    import io
    import collections
    
    class Token(collections.namedtuple('Token', 'num val start end line')):
        @property
        def name(self):
            return token.tok_name[self.num]
    
    def change_str_to_unicode(text):    
        result = text.splitlines()
        # Insert a dummy line into result so indexing result
        # matches tokenize's 1-based indexing
        result.insert(0, '')
        changes = []
        for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
            tok = Token(*tok)
            if tok.name == 'STRING' and not tok.val.startswith('u'):
                changes.append(tok.start)
    
        for linenum, s in reversed(changes):
            line = result[linenum]
            result[linenum] = line[:s] + 'u' + line[s:]
        return '\n'.join(result[1:])
    
    text = '''print "result code %d" % result['code']
    # doesn't touch 'strings' in comments
    'handles multilines' + \
    'okay'
    u'Unicode is not touched'
    '''
    
    print(change_str_to_unicode(text))
    

    yields

    print u"result code %d" % result[u'code']
    # doesn't touch 'strings' in comments
    u'handles multilines' + u'okay'
    u'Unicode is not touched'