Search code examples
pythonpython-2.7unicodenltkpython-unicode

Unicode object to a list


I have a utf8 - text corpus I can read easily in Python 2.7 :

sentence = codecs.open("D:\\Documents\\files\\sentence.txt", "r", encoding="utf8")
sentence = sentence.read()

> This is my sentence in the right format

However, when I pass this text corpus to a list (for example, for tokenizing) :

tokens = sentence.tokenize()

and print it in the notebook, I obtain bit-like caracters, like :

(u'\ufeff\ufeffFaux,', u'Tunisie')
(u'Tunisie', u"l'\xc9gypte,")

Whereas I would like normal characters just like in my original import.

So my question is : how can I pass unicode objects to a list without having strange bit/ASCII characters ?


Solution

  • It's all in how you print. Python 2 displays lists using ASCII-only characters and substituting backslash escape codes for non-ASCII characters. This is to make it easy to see hidden characters that normal printing would make invisible, like the double byte-order-mark (BOM) \ufeff you see in your strings. Printing individual string items will display them correctly.

    Many examples

    Original strings:

    >>> s = (u'\ufeff\ufeffFaux,', u'Tunisie')
    >>> t = (u'Tunisie', u"l'\xc9gypte,")
    

    Displaying at the interactive prompt:

    >>> s
    (u'\ufeff\ufeffFaux,', u'Tunisie')
    >>> t
    (u'Tunisie', u"l'\xc9gypte,")
    >>> print s
    (u'\ufeff\ufeffFaux,', u'Tunisie')
    >>> print t
    (u'Tunisie', u"l'\xc9gypte,")
    

    Printing individual strings from the tuples:

    >>> print s[0]
    Faux,
    >>> print s[1]
    Tunisie
    >>> print t[0]
    Tunisie
    >>> print t[1]
    l'Égypte,
    >>> print ' '.join(s)
    Faux, Tunisie
    >>> print ' '.join(t)
    Tunisie l'Égypte,
    

    A way to print tuples without escape codes:

    >>> print "('"+"', '".join(s)+"')"
    ('Faux,', 'Tunisie')
    >>> print "('"+"', '".join(t)+"')"
    ('Tunisie', 'l'Égypte,')