Search code examples
androidpythonunicodepython-2.6sl4a

python sl4a unicode (Android)


Does anyone have any experience with this?

I have been using python 3.2 for the last half a year, and my memory of 2.6.2 is not that great.

On my computer the following code works, tested using 2.6.1:

import contextlib
import codecs

def readfile(path):
    with contextlib.closing( codecs.open( path, 'r', 'utf-8' )) as f:
        for line in f:
            yield line

path = '/path/to/norsk/verbs.txt'

for i in readfile(path):
    print i

but on the phone it gets to the first special character ø and throws:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 3: ordinal not in range(128)

any ideas as I am going to need to input them as well as read form a file?


Solution

  • Printing is an I/O operation. I/O requires bytes. What you have in i is unicode, or characters. Characters only convert directly to bytes when we're talking about ascii, but on your phone you have encountered a non-ascii character (u'\xf8' is ø). To convert characters to bytes, you need to encode them.

    import contextlib
    import codecs
    
    def readfile(path):
        with contextlib.closing( codecs.open( path, 'r', 'utf-8' )) as f:
            for line in f:
                yield line
    
    path = '/path/to/norsk/verbs.txt'
    
    for i in readfile(path):
        print i.encode('utf8')
    

    As to why this works on your code works on one machine and not the other, I bet python's autodetection has found different things in those cases. Run this on each device:

    $ python
    >>> import sys
    >>> sys.getfilesystemencoding()
    'UTF-8'
    

    I expect you'll see utf8 on one and ascii on the other. This is what print uses when the destination is a terminal. If you're sure that all users of your python installation (very possibly just you) prefer utf8 over ascii, you can change the default encoding of your python installation.

    1. Find your site.py: python -c 'import site; print site
    2. Open it and find the setencoding function:

      def setencoding(): 
          """Set the string encoding used by the Unicode implementation.  The 
          default is 'ascii', but if you're willing to experiment, you can 
          change this.""" 
          encoding = "ascii" # Default value set by _PyUnicode_Init() 
      
    3. Change the encoding = "ascii" line to encoding = "UTF-8"

    Enjoy as things Just Work. You can find more information on this topic here: http://blog.ianbicking.org/illusive-setdefaultencoding.html

    If you'd instead like a strict separation of bytes vs characters such as python3 provides, you can set encoding = "undefined". The undefined codec will "Raise an exception for all conversions. Can be used as the system encoding if no automatic coercion between byte and Unicode strings is desired."