Search code examples
dockerlocaleubuntu-14.04python-3.4docker-registry

Python3 utf8 codecs not decoding as expected in Docker ubuntu:trusty


The following thing really bugs me, the version of python on my laptop and the version of python inside Docker's ubuntu:trusty image are printing different results with their codecs, what is the reason for that? For example, python3 on my laptop(ubuntu, trusty):

Python 3.4.3 (default, Apr 14 2015, 14:16:55) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import codecs
>>> codecs.decode(b'\xe2\x80\x99','utf8')
'’'
>>> 

python3 on Docker ubuntu:latest:

Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import codecs
>>> codecs.decode(b'\xe2\x80\x99','utf8')
'\u2019'
>>> 

Can i make the python3 codecs on Docker's ubuntu:trusty decode b'\xe2\x80\x99' as '’'?


Solution

  • This sounds like a locale configuration issue. Python could be behaving differently in the two locations because the terminal sessions it's running in are configured differently.

    Check your locale settings on your Ubuntu Docker machine to see that you're in a UTF-8 locale in your terminal session. In particular, see if you've been switched over to C for your CTYPE. (I've seen that on servers before, though don't know why it happens.) That could make a difference as to whether the Python console considers it a printable character and thus whether to display it as itself or an escape sequence. This would affect other terminal programs, too.

    I was able to reproduce this behavior in Python 3.4.0 on OS X by fiddling with the locale settings.

    [@ in ~]
    $ locale
    LANG="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_CTYPE="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_ALL=
    [@ in ~]
    $ python3.4
    Python 3.4.0 (v3.4.0:04f714765c13, Mar 15 2014, 23:02:41)
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import codecs
    >>> codecs.decode(b'\xe2\x80\x99','utf8')
    '’'
    >>> quit()
    [@ in ~]
    $ LC_CTYPE=C python3.4
    Python 3.4.0 (v3.4.0:04f714765c13, Mar 15 2014, 23:02:41)
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import codecs
    >>> codecs.decode(b'\xe2\x80\x99','utf8')
    '\u2019'
    >>> quit()
    

    If it's your locale settings doing it, you need to either set up your rc files on the Docker Ubuntu instance to configure your locale to be the appropriate UTF-8 locale for you, or get your locale settings to propagate through SSH or whatever connection method you're using, in order to configure your remote terminal session there. Propagating your locale through connections may make more sense because it could fix it for other servers or accounts you connect to as well.