string python-2.7 python-unicode representation

How to return str representation of non- Ascii letters in python

I have a code snippet that can separate Portuguese texts from numbers. The code is:

import re
def name():
    text = u'Obras de revisão e recuperação (45453000-7)'
    splits = text.split(u" (")
    return(str(splits[0].encode("utf8")))
name()

and the output is:'Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o'

but when I write

print(splits[0].encode("utf8"))

the output would be: Obras de revisão e recuperação which is my desired result.

but it doesn't work with the return function. I read the difference between __str__ and __repr__, but still, I am clueless how to get the same output as __str__ with the return inside a function.

Solution

You are overthinking this. You use a unicode literal to make your unicode object, and then your splits list will contain unicode objects:

In [4]: def name():
   ...:     text = u'Obras de revisão e recuperação (45453000-7)'
   ...:     splits = text.split(u" (")
   ...:     return splits
   ...:

In [5]: splits = name()

In [6]: splits
Out[6]: [u'Obras de revis\xe3o e recupera\xe7\xe3o', u'45453000-7)']

When a list is printed to the screen, the __repr__ of the objects contained in the list is used. However, if you want the __str__, just use print:

In [7]: for piece in splits:
   ...:     print(piece)
   ...:
Obras de revisão e recuperação
45453000-7)

Note, .encode returns a byte-string, i.e. a regular, non-unicode Python 2 str. Calling str on it is essentially the identity function, it's already a str when you encode it:

In [8]: splits[0].encode('utf8')
Out[8]: 'Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o'

In [9]: str(splits[0].encode('utf8'))
Out[9]: 'Obras de revis\xc3\xa3o e recupera\xc3\xa7\xc3\xa3o'

You should really, really consider using Python 3, which streamlines this. str in Python 3 corresponds to Python 2 unicode, and Python 2 str corresponds to Python 3 bytes objects.

So, to clarify things, your name function should work like this:

In [16]: def name():
    ...:     text = u'Obras de revisão e recuperação (45453000-7)'
    ...:     splits = text.split(u" (")
    ...:     return splits[0]
    ...:

In [17]: print(name())
Obras de revisão e recuperação