Search code examples
pythonpython-2.7urllib2

urlib.urlretrieve and urlib2 corrupting files


I've run into a frustrating stumbling block with an XBMC extension I'm working on.

In summary, if I download a file using Firefox, IE, etc then the file is valid and works fine but if I use urlib or urlib2 in python then the file is corrupted.

The file in question is: http://re.zoink.it/00b007c479 (007960DAD4832AC714C465E207055F2BE18CAFF6.torrent)

Here are the checksums:

PY: 2d1528151c62526742ce470a01362ab8ea71e0a7
IE: 60a93c309cae84a984bc42820e6741e4f702dc21

Checksum mis-match (Python DL is corrupt, IE/FF DL is not corrupt)

Here's the function that I've written to do this task

def DownloadFile(uri, localpath):
  '''Downloads a file from the specified Uri to the local system.

  Keyword arguments:
  uri -- the remote uri to the resource to download
  localpath -- the local path to save the downloaded resource 
  '''
  remotefile = urllib2.urlopen(uri)
  # Get the filename from the content-disposition header
  cdHeader = remotefile.info()['content-disposition']

  # typical header looks like: 'attachment;   filename="Boardwalk.Empire.S05E00.The.Final.Shot.720p.HDTV.x264-BATV.[eztv].torrent"'
  # use RegEx to slice out the part we want (filename)
  filename = re.findall('filename=\"(.*?)\"', cdHeader)[0]    
  filepath = os.path.join(localpath, filename)
  if (os.path.exists(filepath)):
      return

  data = remotefile.read()
  with open(filepath, "wb") as code:
    code.write(data) # this is resulting in a corrupted file

  #this is resulting in a corrupted file as well
  #urllib.urlretrieve(uri, filepath)

What am I doing wrong? Its hit or miss; some sources download correctly and others always result in a corrupted file if I download with python. They all seem to download correctly is I use a web browser

Thanks in advance...


Solution

  • The response is Gzip-encoded:

    >>> import urllib2
    >>> remotefile = urllib2.urlopen('http://re.zoink.it/00b007c479')
    >>> remotefile.info()['content-encoding']
    'gzip'
    

    Your browser decodes this for you, but urllib2 does not. You'll need to do this yourself first:

    import zlib
    
    data = remotefile.read()
    if remotefile.info().get('content-encoding') == 'gzip':
        data = zlib.decompress(data, zlib.MAX_WBITS + 16)
    

    Once decompressed the data fits your SHA1 signature perfectly:

    >>> import zlib
    >>> import hashlib
    >>> data = remotefile.read()
    >>> hashlib.sha1(data).hexdigest()
    '2d1528151c62526742ce470a01362ab8ea71e0a7'
    >>> hashlib.sha1(zlib.decompress(data, zlib.MAX_WBITS + 16)).hexdigest()
    '60a93c309cae84a984bc42820e6741e4f702dc21'
    

    You probably want to switch to using the requests module, which handles content encoding transparently:

    >>> import requests
    >>> response = requests.get('http://re.zoink.it/00b007c479')
    >>> hashlib.sha1(response.content).hexdigest()
    '60a93c309cae84a984bc42820e6741e4f702dc21'