Search code examples
pythonencodingrequesthttp-headersurllib

Cannot resolve encoding of filename in HTTP response headers


I am trying to make an HTTP request in Python using urrlib.request:

import urllib.request
url = 'https://www.example.com/pdf/123'
request = urllib.request.urlopen(url)
headers = request.getheaders()

When trying to print headers, the output includes filename that is in Cyrillic language but in wrong encoding:

('Content-Disposition', 'attachment; filename="Ð\x9fÑ\x80о наÑ\x83к-доÑ\x81л Ñ\x81емÑ\x96наÑ\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;')

It probably has something to do with the binary encoding that is set by default since the HTTp response is PDF file but I can be wrong. Also, tried to download that file via browser and the filename is displayed and saved correctly in a Cyrillic language without mojibake: Про наук-досл семінар.pdf.

So, I guess, the "Ð\x9fÑ\x80о наÑ\x83к-доÑ\x81л Ñ\x81емÑ\x96наÑ\x80" corresponds to "Про наук-досл семінар".

How can I make Python display the filename correctly in the HTTP response headers?


Solution

  • Figured it out. Encoding the returned string from headers as latin-1 and then decoding it as utf-8 worked for me.

    Input:

    headers[6][1]
    

    Output:

    'attachment; filename="Ð\x9fÑ\x80о наÑ\x83к-доÑ\x81л Ñ\x81емÑ\x96наÑ\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'
    

    Input:

    headers[6][1].encode('latin1')
    

    Output:

    b'attachment; filename="\xd0\x9f\xd1\x80\xd0\xbe \xd0\xbd\xd0\xb0\xd1\x83\xd0\xba-\xd0\xb4\xd0\xbe\xd1\x81\xd0\xbb \xd1\x81\xd0\xb5\xd0\xbc\xd1\x96\xd0\xbd\xd0\xb0\xd1\x80.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'
    

    Input:

    headers[6][1].encode('latin1').decode('utf-8')
    

    Output:

    'attachment; filename="Про наук-досл семінар.pdf"; modification-date="Thu, 12 May 2016 01:48:56 +0300"; size=57814;'