Search code examples
pythonunicodepython-unicodemojibake

python unicode get value / get text


Let's say I have an unicode variable:

uni_var = u'Na teatr w pi\xc4\x85tek'

I want to have a string, which will be the same as uni_var, just without the "u", so:

str_var = 'Na teatr w pi\xc4\x85tek'

How can I do it? I would like to find something like:

str_var = uni_var.text()

Solution

  • You appear to have badly decoded Unicode; those are UTF-8 bytes masking as Latin-1 codepoints.

    You can get back to proper UTF-8 bytes by encoding to a codec that maps Unicode codepoints one-on-one to bytes, like Latin-1:

    >>> uni_var = u'Na teatr w pi\xc4\x85tek'
    >>> uni_var.encode('latin1')
    'Na teatr w pi\xc4\x85tek'
    

    but be careful; it could also be that the CP1252 encoding was used to decode to Unicode here. It all depends on where this Mojibake was produced.

    You could also use the ftfy library to detect how to best repair this; it produces Unicode output:

    >>> import ftfy
    >>> uni_var = u'Na teatr w pi\xc4\x85tek'
    >>> ftfy.fix_text(uni_var)
    u'Na teatr w pi\u0105tek'
    >>> print ftfy.fix_text(uni_var)
    Na teatr w piątek
    

    The library will handle CP1252 Mojibake's automatically.