Search code examples
pythonpython-2.7urllib2

Downloading a file from the internet with python


I'm trying to retrieve CSV data from a website through this link.

When downloaded manually you get synop.201708.csv.gz which is in fact a csv wrongly named .gz, it weights 2233KB

When running this code :

import urllib

file_date = '201708'
file_url = "https://donneespubliques.meteofrance.fr/donnees_libres/Txt/Synop/Archive/synop.{}.csv.gz".format(file_date)
output_file_name = "{}.csv.gz".format(file_date)

print "downloading {} to {}".format(file_url, output_file_name)
urllib.urlretrieve (file_url, output_file_name)

I'm getting a corrupted ~361Kb file

Any ideas why?


Solution

  • What seems to be happening is that the MétéoFrance site is misusing the Content-Encoding header. The website reports that it is serving you a gzip file (Content-Type: application/x-gzip) and that it is encoding it in gzip format for the transfer (Content-Encoding: x-gzip). It is also saying the page is an attachment, which should be saved under its normal name (Content-Disposition: attachment)

    In a vacuum, this would make sense (to a degree; compressing an already compressed file is mostly useless): The server serves a gzip file and compresses it again for transport. Upon receipt, your browser undoes the transport compression and saves the original gzip file. Here, it decompresses the stream, but since it wasn't compressed again, it doesn't work as expected.