Search code examples
pythondjangoformat-string

UnicodeDecodeError using Django and format-strings


I wrote a small example of the issue for everybody to see what's going on using Python 2.7 and Django 1.10.8

# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, unicode_literals, print_function

import time
from django import setup
setup()
from django.contrib.auth.models import Group

group = Group(name='schön')

print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))

print(group)
print(repr(group))
print(str(group))
print(unicode(group))

time.sleep(1.0)
print('%s' % group)
print('%r' % group)   # fails
print('%s' % [group]) # fails
print('%r' % [group]) # fails

Exits with the following output + traceback

$ python .PyCharmCE2017.2/config/scratches/scratch.py
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
Traceback (most recent call last):
  File "/home/srkunze/.PyCharmCE2017.2/config/scratches/scratch.py", line 22, in <module>
    print('%r' % group) # fails
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

Has somebody an idea what's going on here?


Solution

  • At issue here is that you are interpolating UTF-8 bytestrings into a Unicode string. Your '%r' string is a Unicode string because you used from __future__ import unicode_literals, but repr(group) (used by the %r placeholder) returns a bytestring. For Django models, repr() can include Unicode data in the representation, encoded to a bytestring using UTF-8. Such representations are not ASCII safe.

    For your specific example, repr() on your Group instance produces the bytestring '<Group: sch\xc3\xb6n>'. Interpolating that into a Unicode string triggers the implicit decoding:

    >>> u'%s' % '<Group: sch\xc3\xb6n>'
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
    

    Note that I did not use from __future__ import unicode_literals in my Python session, so the '<Group: sch\xc3\xb6n>' string is not a unicode object, it is a str bytestring object!

    In Python 2, you should avoid mixing Unicode and byte strings. Always explicitly normalise your data (encoding Unicode to bytes or decoding bytes to Unicode).

    If you must use from __future__ import unicode_literals, you can still create bytestrings by using a b prefix:

    >>> from __future__ import unicode_literals
    >>> type('')   # empty unicode string
    <type 'unicode'>
    >>> type(b'')  # empty bytestring, note the b prefix
    <type 'str'>
    >>> b'%s' % b'<Group: sch\xc3\xb6n>'  # two bytestrings
    '<Group: sch\xc3\xb6n>'