Search code examples
pythonutf-8python-2.xmojibakepyquery

Convert unicode with utf-8 string as content to str


I'm using pyquery to parse a page:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content:

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

how could I convert it to str without lost the content?

to make it clear:

I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'


Solution

  • If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

    content = content.encode('latin1')
    

    because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

    For your example this gives me:

    >>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
    >>> content.encode('latin1')
    '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
    >>> content.encode('latin1').decode('utf8')
    u'\u5c42\u53e0\u6837\u5f0f\u8868'
    >>> print content.encode('latin1').decode('utf8')
    层叠样式表
    

    PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests, uses the .text attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). You can override this by passing in an encoding argument:

    dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
                  {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
    

    at which point you'd not have to re-encode at all.