Search code examples
pythonpython-3.xurllib

Fetching URL and converting to UTF-8 Python


I would like to do my first project in python but I have problem with coding. When I fetch data it shows coded letters instead of my native letters, for example '\xc4\x87' instead of 'ć'. The code is below:

import urllib.request
import sys

page = urllib.request.urlopen("http://olx.pl/")
test = page.read()

print(test)
print(sys.stdin.encoding)
z = "ł"
print(z)
print(z.encode("utf-8"))

I know that code here is poor but I tried many options to change encoding. I wrote z = "ł" to check if it can print any 'special' letter and it shows. I tried to encode it and it works also as it should. Sys.stdin.encoding shows cp852.


Solution

  • The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.

    You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:

    test = page.read().decode('utf8')
    

    However, it is up to the server to tell you what data was received. Check for a characterset in the headers:

    encoding = page.info().getparam('charset')
    

    This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.

    You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.