I'm trying to retrieve CSV data from a website through this link.
When downloaded manually you get synop.201708.csv.gz
which is in fact a csv wrongly named .gz, it weights 2233KB
When running this code :
import urllib
file_date = '201708'
file_url = "https://donneespubliques.meteofrance.fr/donnees_libres/Txt/Synop/Archive/synop.{}.csv.gz".format(file_date)
output_file_name = "{}.csv.gz".format(file_date)
print "downloading {} to {}".format(file_url, output_file_name)
urllib.urlretrieve (file_url, output_file_name)
I'm getting a corrupted ~361Kb file
Any ideas why?
What seems to be happening is that the MétéoFrance site is misusing the Content-Encoding
header. The website reports that it is serving you a gzip file (Content-Type: application/x-gzip
) and that it is encoding it in gzip format for the transfer (Content-Encoding: x-gzip
). It is also saying the page is an attachment, which should be saved under its normal name (Content-Disposition: attachment
)
In a vacuum, this would make sense (to a degree; compressing an already compressed file is mostly useless): The server serves a gzip file and compresses it again for transport. Upon receipt, your browser undoes the transport compression and saves the original gzip file. Here, it decompresses the stream, but since it wasn't compressed again, it doesn't work as expected.