Search code examples
pythonpython-requestspycharmzip

Read two files inside a .zip URL: delete the PDF, keep the CSV


I want to download, save and clean a set of datasets that are storaged as .zip files in more than 150 URLs. My function follows the package documentation like this:

import requests

def download_url(url, save_path, chunk_size = 128):

    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)

But this is not working. Issue: 'requests.get' doesn't return what I need. I think that might occur because there are two distinct files in the .zip: a .csv and a .pdf. Is it there a way to read both files, delete the .pdf, and save only the .csv?


Solution

  • The code below did it for me:

    from urllib.request import urlopen
    from io import BytesIO
    import zipfile37
    import pandas as pd
    dfs = {}
    req = urlopen('https://cdn.tse.jus.br/estatistica/sead/odsele/votacao_secao/votacao_secao_2014_RJ.zip')
    data = req.read()
    
    zip_file = zipfile37.ZipFile(BytesIO(data))
    for name in zip_file.namelist():
        if name.lower().endswith('.txt'):
            dfs[name] = pd.read_csv(zip_file.open(name), sep=";", header=None, encoding='latin1')
    

    Its a tweek from How to open a csv in a zip in a zip with python?

    Thanks!