Search code examples
stringpython-2.7python-unicoderepresentation

How to return str representation of non- Ascii letters in python


I have a code snippet that can separate Portuguese texts from numbers. The code is:

import re
def name():
    text = u'Obras de revisão e recuperação (45453000-7)'
    splits = text.split(u" (")
    return(str(splits[0].encode("utf8")))
name()

and the output is:'Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o'

but when I write

print(splits[0].encode("utf8"))

the output would be: Obras de revisão e recuperação which is my desired result.

but it doesn't work with the return function. I read the difference between __str__ and __repr__, but still, I am clueless how to get the same output as __str__ with the return inside a function.


Solution

  • You are overthinking this. You use a unicode literal to make your unicode object, and then your splits list will contain unicode objects:

    In [4]: def name():
       ...:     text = u'Obras de revisão e recuperação (45453000-7)'
       ...:     splits = text.split(u" (")
       ...:     return splits
       ...:
    
    In [5]: splits = name()
    
    In [6]: splits
    Out[6]: [u'Obras de revis\xe3o e recupera\xe7\xe3o', u'45453000-7)']
    

    When a list is printed to the screen, the __repr__ of the objects contained in the list is used. However, if you want the __str__, just use print:

    In [7]: for piece in splits:
       ...:     print(piece)
       ...:
    Obras de revisão e recuperação
    45453000-7)
    

    Note, .encode returns a byte-string, i.e. a regular, non-unicode Python 2 str. Calling str on it is essentially the identity function, it's already a str when you encode it:

    In [8]: splits[0].encode('utf8')
    Out[8]: 'Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o'
    
    In [9]: str(splits[0].encode('utf8'))
    Out[9]: 'Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o'
    

    You should really, really consider using Python 3, which streamlines this. str in Python 3 corresponds to Python 2 unicode, and Python 2 str corresponds to Python 3 bytes objects.

    So, to clarify things, your name function should work like this:

    In [16]: def name():
        ...:     text = u'Obras de revisão e recuperação (45453000-7)'
        ...:     splits = text.split(u" (")
        ...:     return splits[0]
        ...:
    
    In [17]: print(name())
    Obras de revisão e recuperação