I am trying to make an HTTP request in Python using urrlib.request:
import urllib.request
url = 'https://www.example.com/pdf/123'
request = urllib.request.urlopen(url)
headers = request.getheaders()
When trying to print headers, the output includes filename that is in Cyrillic language but in wrong encoding:
('Content-Disposition', 'attachment; filename="Ð\x9fÑ\x80о наÑ\x83к-доÑ\x81л Ñ\x81емÑ\x96наÑ\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;')
It probably has something to do with the binary encoding that is set by default since the HTTp response is PDF file but I can be wrong. Also, tried to download that file via browser and the filename is displayed and saved correctly in a Cyrillic language without mojibake: Про наук-досл семінар.pdf
.
So, I guess, the "Ð\x9fÑ\x80о наÑ\x83к-доÑ\x81л Ñ\x81емÑ\x96наÑ\x80"
corresponds to "Про наук-досл семінар"
.
How can I make Python display the filename correctly in the HTTP response headers?
Figured it out. Encoding the returned string from headers as latin-1 and then decoding it as utf-8 worked for me.
Input:
headers[6][1]
Output:
'attachment; filename="Ð\x9fÑ\x80о наÑ\x83к-доÑ\x81л Ñ\x81емÑ\x96наÑ\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'
Input:
headers[6][1].encode('latin1')
Output:
b'attachment; filename="\xd0\x9f\xd1\x80\xd0\xbe \xd0\xbd\xd0\xb0\xd1\x83\xd0\xba-\xd0\xb4\xd0\xbe\xd1\x81\xd0\xbb \xd1\x81\xd0\xb5\xd0\xbc\xd1\x96\xd0\xbd\xd0\xb0\xd1\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'
Input:
headers[6][1].encode('latin1').decode('utf-8')
Output:
'attachment; filename="Про наук-досл семінар.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'