Search code examples
pythonpandasunzipzip

Create Dataframe from a csv inside a zip file


I am trying to read WGIData.csv file in a pandas dataframe. WGIData.csv is present inside a zip file which i am downloading from this url

http://databank.worldbank.org/data/download/WGI_csv.zip

But when i tried to read, it throws error BadZipFile: File is not a zip file

Here is my python code

import pandas as pd
from urllib.request import urlopen
from zipfile import ZipFile

class Get_Data():

    def Return_csv_from_zip(self, url):
        self.zip = urlopen(url)
        self.myzip = ZipFile(self.zip)
        self.myzip = self.zip.extractall(self.myzip)
        self.file = pd.read_csv(self.myzip)
        self.zip.close()

        return self.file

url = 'http://databank.worldbank.org/data/download/WGI_csv.zip'
data = Get_Data()
df = data.Return_csv_from_zip(url)

Solution

  • urlopen() does not return an object (HTTPResponse) you can send to ZipFile(). You can read() the response and use io.BytesIO() to do what you need:

    In []:
    from io import BytesIO
    
    z = urlopen('http://databank.worldbank.org/data/download/WGI_csv.zip')
    myzip = ZipFile(BytesIO(z.read())).extract('WGIData.csv')
    pd.read_csv(myzip)
    
    Out[]:
         Country Name Country Code                                     Indicator Name    Indicator Code       1996  \
    0        Anguilla          AIA                    Control of Corruption: Estimate            CC.EST        NaN   
    1        Anguilla          AIA           Control of Corruption: Number of Sources         CC.NO.SRC        NaN   
    2        Anguilla          AIA             Control of Corruption: Percentile Rank        CC.PER.RNK        NaN   
    3        Anguilla          AIA  Control of Corruption: Percentile Rank, Lower ...  CC.PER.RNK.LOWER        NaN   
    4        Anguilla          AIA  Control of Corruption: Percentile Rank, Upper ...  CC.PER.RNK.UPPER        NaN   
    5        Anguilla          AIA              Control of Corruption: Standard Error        CC.STD.ERR        NaN   
    ...