I want to download, save and clean a set of datasets that are storaged as .zip files in more than 150 URLs. My function follows the package documentation like this:
import requests
def download_url(url, save_path, chunk_size = 128):
r = requests.get(url, stream=True)
with open(save_path, 'wb') as fd:
for chunk in r.iter_content(chunk_size=chunk_size):
fd.write(chunk)
But this is not working. Issue: 'requests.get' doesn't return what I need. I think that might occur because there are two distinct files in the .zip: a .csv and a .pdf. Is it there a way to read both files, delete the .pdf, and save only the .csv?
The code below did it for me:
from urllib.request import urlopen
from io import BytesIO
import zipfile37
import pandas as pd
dfs = {}
req = urlopen('https://cdn.tse.jus.br/estatistica/sead/odsele/votacao_secao/votacao_secao_2014_RJ.zip')
data = req.read()
zip_file = zipfile37.ZipFile(BytesIO(data))
for name in zip_file.namelist():
if name.lower().endswith('.txt'):
dfs[name] = pd.read_csv(zip_file.open(name), sep=";", header=None, encoding='latin1')
Its a tweek from How to open a csv in a zip in a zip with python?
Thanks!