python python-2.7 debugging unicode monkeypatching

Tracking down implicit unicode conversions in Python 2

I have a large project where at various places problematic implicit Unicode conversions (coersions) were used in the form of e.g.:

someDynamicStr = "bar" # could come from various sources

# works
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)

someDynamicStr = "\xff" # uh-oh

# raises UnicodeDecodeError
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)

(Possibly other forms as well.)

Now I would like to track down those usages, especially those in actively used code.

It would be great if I could easily replace the unicode constructor with a wrapper which checks whether the input is of type str and the encoding/errors parameters are set to the default values and then notifies me (prints traceback or such).

/edit:

While not directly related to what I am looking for I came across this gloriously horrible hack for how to make the decode exception go away altogether (the decode one only, i.e. str to unicode, but not the other way around, see https://mail.python.org/pipermail/python-list/2012-July/627506.html).

I don't plan on using it but it might be interesting for those battling problems with invalid Unicode input and looking for a quick fix (but please think about the side effects):

import codecs
codecs.register_error("strict", codecs.ignore_errors)
codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively

(An internet search for codecs.register_error("strict" revealed that apparently it's used in some real projects.)

/edit #2:

For explicit conversions I made a snippet with the help of a SO post on monkeypatching:

class PatchedUnicode(unicode):
  def __init__(self, obj=None, encoding=None, *args, **kwargs):
    if encoding in (None, "ascii", "646", "us-ascii"):
        print("Problematic unicode() usage detected!")
    super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs)

import __builtin__
__builtin__.unicode = PatchedUnicode

This only affects explicit conversions using the unicode() constructor directly so it's not something I need.

/edit #3:

The thread "Extension method for python built-in types!" makes me think that it might actually not be easily possible (in CPython at least).

/edit #4:

It's nice to see many good answers here, too bad I can only give out the bounty once.

In the meantime I came across a somewhat similar question, at least in the sense of what the person tried to achieve: Can I turn off implicit Python unicode conversions to find my mixed-strings bugs? Please note though that throwing an exception would not have been OK in my case. Here I was looking for something which might point me to the different locations of problematic code (e.g. by printing smth.) but not something which might exit the program or change its behavior (because this way I can prioritize what to fix).

On another note, the people working on the Mypy project (which include Guido van Rossum) might also come up with something similar helpful in the future, see the discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208.

/edit #5

I also came across the following but didn't have yet the time to test it: https://pypi.python.org/pypi/unicode-nazi

Solution

You can register a custom encoding which prints a message whenever it's used:

Code in ourencoding.py:

import sys
import codecs
import traceback

# Define a function to print out a stack frame and a message:

def printWarning(s):
    sys.stderr.write(s)
    sys.stderr.write("\n")
    l = traceback.extract_stack()
    # cut off the frames pointing to printWarning and our_encode
    l = traceback.format_list(l[:-2])
    sys.stderr.write("".join(l))

# Define our encoding:

originalencoding = sys.getdefaultencoding()

def our_encode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.encode(s, originalencoding, errors), len(s))

def our_decode(s, errors='strict'):
    printWarning("Default encoding used");
    return (codecs.decode(s, originalencoding, errors), len(s))

def our_search(name):
    if name == 'our_encoding':
        return codecs.CodecInfo(
            name='our_encoding',
            encode=our_encode,
            decode=our_decode);
    return None

# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')

If you import this file at the start of our script, then you'll see warnings for implicit conversions:

#!python2
# coding: utf-8

import ourencoding

print("test 1")
a = "hello " + u"world"

print("test 2")
a = "hello ☺ " + u"world"

print("test 3")
b = u" ".join(["hello", u"☺"])

print("test 4")
c = unicode("hello ☺")

output:

test 1
test 2
Default encoding used
 File "test.py", line 10, in <module>
   a = "hello ☺ " + u"world"
test 3
Default encoding used
 File "test.py", line 13, in <module>
   b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
 File "test.py", line 16, in <module>
   c = unicode("hello ☺")

It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.