python python-2.7 unicode nltk python-unicode

Unicode object to a list

I have a utf8 - text corpus I can read easily in Python 2.7 :

sentence = codecs.open("D:\\Documents\\files\\sentence.txt", "r", encoding="utf8")
sentence = sentence.read()

> This is my sentence in the right format

However, when I pass this text corpus to a list (for example, for tokenizing) :

tokens = sentence.tokenize()

and print it in the notebook, I obtain bit-like caracters, like :

(u'\ufeff\ufeffFaux,', u'Tunisie')
(u'Tunisie', u"l'\xc9gypte,")

Whereas I would like normal characters just like in my original import.

So my question is : how can I pass unicode objects to a list without having strange bit/ASCII characters ?

Solution

It's all in how you print. Python 2 displays lists using ASCII-only characters and substituting backslash escape codes for non-ASCII characters. This is to make it easy to see hidden characters that normal printing would make invisible, like the double byte-order-mark (BOM) \ufeff you see in your strings. Printing individual string items will display them correctly.

Many examples

Original strings:

>>> s = (u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t = (u'Tunisie', u"l'\xc9gypte,")

Displaying at the interactive prompt:

>>> s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t
(u'Tunisie', u"l'\xc9gypte,")
>>> print s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> print t
(u'Tunisie', u"l'\xc9gypte,")

Printing individual strings from the tuples:

>>> print s[0]
Faux,
>>> print s[1]
Tunisie
>>> print t[0]
Tunisie
>>> print t[1]
l'Égypte,
>>> print ' '.join(s)
Faux, Tunisie
>>> print ' '.join(t)
Tunisie l'Égypte,

A way to print tuples without escape codes:

>>> print "('"+"', '".join(s)+"')"
('Faux,', 'Tunisie')
>>> print "('"+"', '".join(t)+"')"
('Tunisie', 'l'Égypte,')