Does anyone have any experience with this?
I have been using python 3.2 for the last half a year, and my memory of 2.6.2 is not that great.
On my computer the following code works, tested using 2.6.1:
import contextlib
import codecs
def readfile(path):
with contextlib.closing( codecs.open( path, 'r', 'utf-8' )) as f:
for line in f:
yield line
path = '/path/to/norsk/verbs.txt'
for i in readfile(path):
print i
but on the phone it gets to the first special character ø
and throws:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 3: ordinal not in range(128)
any ideas as I am going to need to input them as well as read form a file?
Printing is an I/O operation. I/O requires bytes. What you have in i
is unicode, or characters. Characters only convert directly to bytes when we're talking about ascii, but on your phone you have encountered a non-ascii character (u'\xf8' is ø). To convert characters to bytes, you need to encode them.
import contextlib
import codecs
def readfile(path):
with contextlib.closing( codecs.open( path, 'r', 'utf-8' )) as f:
for line in f:
yield line
path = '/path/to/norsk/verbs.txt'
for i in readfile(path):
print i.encode('utf8')
As to why this works on your code works on one machine and not the other, I bet python's autodetection has found different things in those cases. Run this on each device:
$ python
>>> import sys
>>> sys.getfilesystemencoding()
'UTF-8'
I expect you'll see utf8 on one and ascii on the other. This is what print uses when the destination is a terminal. If you're sure that all users of your python installation (very possibly just you) prefer utf8 over ascii, you can change the default encoding of your python installation.
python -c 'import site; print site
Open it and find the setencoding function:
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "ascii" # Default value set by _PyUnicode_Init()
Change the encoding = "ascii"
line to encoding = "UTF-8"
Enjoy as things Just Work. You can find more information on this topic here: http://blog.ianbicking.org/illusive-setdefaultencoding.html
If you'd instead like a strict separation of bytes vs characters such as python3 provides, you can set encoding = "undefined"
. The undefined
codec will "Raise an exception for all conversions. Can be used as the system encoding if no automatic coercion between byte and Unicode strings is desired."