Search code examples
pythonpandasdataframecsvzip

How to Load a CSV File from zipped folder from URL into Pandas DataFrame


I wanted to load a CSV file from a zipped folder from a URL into a Pandas DataFrame. I referred here and used the same solution as follows:

from urllib import request
import zipfile

# link to the zip file
link = 'https://cricsheet.org/downloads/'
# the zip file is named as ipl_csv2.zip
request.urlretrieve(link, 'ipl_csv2.zip')
compressed_file = zipfile.ZipFile('ipl_csv2.zip')

# I need the csv file named all_matches.csv from ipl_csv2.zip
csv_file = compressed_file.open('all_matches.csv')
data = pd.read_csv(csv_file)
data.head()

But after running the code, I'm getting an error as:

BadZipFile                                Traceback (most recent call last)
<ipython-input-3-7b7a01259813> in <module>
      1 link = 'https://cricsheet.org/downloads/'
      2 request.urlretrieve(link, 'ipl_csv2.zip')
----> 3 compressed_file = zipfile.ZipFile('ipl_csv2.zip')
      4 csv_file = compressed_file.open('all_matches.csv')
      5 data = pd.read_csv(csv_file)

~\Anaconda3\lib\zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
   1267         try:
   1268             if mode == 'r':
-> 1269                 self._RealGetContents()
   1270             elif mode in ('w', 'x'):
   1271                 # set the modified flag so central directory gets written

~\Anaconda3\lib\zipfile.py in _RealGetContents(self)
   1334             raise BadZipFile("File is not a zip file")
   1335         if not endrec:
-> 1336             raise BadZipFile("File is not a zip file")
   1337         if self.debug > 1:
   1338             print(endrec)

BadZipFile: File is not a zip file

I'm not much used to zip file handling in Python. So please help me out here as to what correction do I need to make in my code?

If I open the URL https://cricsheet.org/downloads/ipl_csv2.zip in a web browser, the zip file gets automatically downloaded in my system. As data gets added daily in this zip file, I want to access the URL and directly get the CSV file via Python to save storage.

Edit1: If you guys have any other code solution, then please do share...


Solution

  • This is what I did after discussion with @nobleknight below:

    # importing libraries
    import zipfile
    from urllib.request import urlopen
    import shutil
    import os
    
    url = 'https://cricsheet.org/downloads/ipl_csv2.zip'
    file_name = 'ipl_csv2.zip'
    
    # extracting zipfile from URL
    with urlopen(url) as response, open(file_name, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
    
        # extracting required file from zipfile
        with zipfile.ZipFile(file_name) as zf:
            zf.extract('all_matches.csv')
    
    # deleting the zipfile from the directory
    os.remove('ipl_csv2.zip')
    
    # loading data from the file
    data = pd.read_csv('all_matches.csv')
    

    This solution prevents the ContentTooShortError and the HTTPForbiddenError errors which I have been facing for every solution I find in the net. Thanks to @nobleknight for providing me a part of the solution with reference to this.

    Any other thoughts are welcome.