python encoding request http-headers urllib

Cannot resolve encoding of filename in HTTP response headers

I am trying to make an HTTP request in Python using urrlib.request:

import urllib.request
url = 'https://www.example.com/pdf/123'
request = urllib.request.urlopen(url)
headers = request.getheaders()

When trying to print headers, the output includes filename that is in Cyrillic language but in wrong encoding:

('Content-Disposition', 'attachment; filename="Ð\x9fÑ\x80Ð¾ Ð½Ð°Ñ\x83Ðº-Ð´Ð¾Ñ\x81Ð» Ñ\x81ÐµÐ¼Ñ\x96Ð½Ð°Ñ\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;')

It probably has something to do with the binary encoding that is set by default since the HTTp response is PDF file but I can be wrong. Also, tried to download that file via browser and the filename is displayed and saved correctly in a Cyrillic language without mojibake: Про наук-досл семінар.pdf.

So, I guess, the "Ð\x9fÑ\x80Ð¾ Ð½Ð°Ñ\x83Ðº-Ð´Ð¾Ñ\x81Ð» Ñ\x81ÐµÐ¼Ñ\x96Ð½Ð°Ñ\x80" corresponds to "Про наук-досл семінар".

How can I make Python display the filename correctly in the HTTP response headers?

Solution

Figured it out. Encoding the returned string from headers as latin-1 and then decoding it as utf-8 worked for me.

Input:

headers[6][1]

Output:

'attachment; filename="Ð\x9fÑ\x80Ð¾ Ð½Ð°Ñ\x83Ðº-Ð´Ð¾Ñ\x81Ð» Ñ\x81ÐµÐ¼Ñ\x96Ð½Ð°Ñ\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'

Input:

headers[6][1].encode('latin1')

Output:

b'attachment; filename="\xd0\x9f\xd1\x80\xd0\xbe \xd0\xbd\xd0\xb0\xd1\x83\xd0\xba-\xd0\xb4\xd0\xbe\xd1\x81\xd0\xbb \xd1\x81\xd0\xb5\xd0\xbc\xd1\x96\xd0\xbd\xd0\xb0\xd1\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'

Input:

headers[6][1].encode('latin1').decode('utf-8')

Output:

'attachment; filename="Про наук-досл семінар.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'