Let's say I have an unicode variable:
uni_var = u'Na teatr w pi\xc4\x85tek'
I want to have a string, which will be the same as uni_var
, just without the "u", so:
str_var = 'Na teatr w pi\xc4\x85tek'
How can I do it? I would like to find something like:
str_var = uni_var.text()
You appear to have badly decoded Unicode; those are UTF-8 bytes masking as Latin-1 codepoints.
You can get back to proper UTF-8 bytes by encoding to a codec that maps Unicode codepoints one-on-one to bytes, like Latin-1:
>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> uni_var.encode('latin1')
'Na teatr w pi\xc4\x85tek'
but be careful; it could also be that the CP1252 encoding was used to decode to Unicode here. It all depends on where this Mojibake was produced.
You could also use the ftfy
library to detect how to best repair this; it produces Unicode output:
>>> import ftfy
>>> uni_var = u'Na teatr w pi\xc4\x85tek'
>>> ftfy.fix_text(uni_var)
u'Na teatr w pi\u0105tek'
>>> print ftfy.fix_text(uni_var)
Na teatr w piątek
The library will handle CP1252 Mojibake's automatically.