Search code examples
pythonpandasurlgitlabzip

Zipped File from URL to Python (Pandas)


I want to load a zipped from from Gitlab to my Jupyter Noteboook with following code:


link='https://git. ... master/data.zip'

import urllib.request
urllib.request.urlretrieve(link, "data.zip")
import zipfile
compressed_file = zipfile.ZipFile('data.zip')
csv_file = compressed_file.open('data.csv')
df = pd.read_csv(csv_file)

I could not download it, I need to get the data from the URL!

I get follwing error in Line 4 (----> 4 compressed_file = zipfile.ZipFile('data.zip'))

BadZipFile: File is not a zip file

What is the error in my code?


Solution

  • Your code sample is not re-producable. Code below shows how to download a zip file from a URL and unzip it. It's geojson so json.loads() is used, but this can be pd.read_csv() for CSV data. This is effectively a three step process

    • pass a URL to requests.get() and download chunks to local file
    • inspect contents of this zip file for file within it you want to use zfile.infolist()
    • open file handle and use it. For your case pd.read_csv()

    All standard requests and file handling independent of usage.

    import requests
    import pandas as pd
    from pathlib import Path
    from zipfile import ZipFile
    import json, io
    
    # source geojson for country boundaries
    geosrc = pd.json_normalize(requests.get("https://pkgstore.datahub.io/core/geo-countries/7/datapackage.json").json()["resources"])
    fn = Path(geosrc.loc[geosrc["name"].eq("geo-countries_zip"), "path"].values[0]).name
    
    if not Path.cwd().joinpath(fn).exists():
        r = requests.get(geosrc.loc[geosrc["name"].eq("geo-countries_zip"), "path"].values[0],stream=True,)
        with open(fn, "wb") as fd:
            for chunk in r.iter_content(chunk_size=128):
                fd.write(chunk)
    
    zfile = ZipFile(fn)
    with zfile.open(zfile.infolist()[0]) as f:
        geojson = json.load(f)