I wanted to load a CSV file from a zipped folder from a URL into a Pandas DataFrame. I referred here and used the same solution as follows:
from urllib import request
import zipfile
# link to the zip file
link = 'https://cricsheet.org/downloads/'
# the zip file is named as ipl_csv2.zip
request.urlretrieve(link, 'ipl_csv2.zip')
compressed_file = zipfile.ZipFile('ipl_csv2.zip')
# I need the csv file named all_matches.csv from ipl_csv2.zip
csv_file = compressed_file.open('all_matches.csv')
data = pd.read_csv(csv_file)
data.head()
But after running the code, I'm getting an error as:
BadZipFile Traceback (most recent call last)
<ipython-input-3-7b7a01259813> in <module>
1 link = 'https://cricsheet.org/downloads/'
2 request.urlretrieve(link, 'ipl_csv2.zip')
----> 3 compressed_file = zipfile.ZipFile('ipl_csv2.zip')
4 csv_file = compressed_file.open('all_matches.csv')
5 data = pd.read_csv(csv_file)
~\Anaconda3\lib\zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
1267 try:
1268 if mode == 'r':
-> 1269 self._RealGetContents()
1270 elif mode in ('w', 'x'):
1271 # set the modified flag so central directory gets written
~\Anaconda3\lib\zipfile.py in _RealGetContents(self)
1334 raise BadZipFile("File is not a zip file")
1335 if not endrec:
-> 1336 raise BadZipFile("File is not a zip file")
1337 if self.debug > 1:
1338 print(endrec)
BadZipFile: File is not a zip file
I'm not much used to zip file handling in Python. So please help me out here as to what correction do I need to make in my code?
If I open the URL https://cricsheet.org/downloads/ipl_csv2.zip
in a web browser, the zip file gets automatically downloaded in my system. As data gets added daily in this zip file, I want to access the URL and directly get the CSV file via Python to save storage.
Edit1: If you guys have any other code solution, then please do share...
This is what I did after discussion with @nobleknight below:
# importing libraries
import zipfile
from urllib.request import urlopen
import shutil
import os
url = 'https://cricsheet.org/downloads/ipl_csv2.zip'
file_name = 'ipl_csv2.zip'
# extracting zipfile from URL
with urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
# extracting required file from zipfile
with zipfile.ZipFile(file_name) as zf:
zf.extract('all_matches.csv')
# deleting the zipfile from the directory
os.remove('ipl_csv2.zip')
# loading data from the file
data = pd.read_csv('all_matches.csv')
This solution prevents the ContentTooShortError
and the HTTPForbiddenError
errors which I have been facing for every solution I find in the net. Thanks to @nobleknight for providing me a part of the solution with reference to this.
Any other thoughts are welcome.