I have a Python script using pandas to combine multiple ZIP files. I am using data on COVID-19 cases in Austria hosted in a GitHub repository here: https://github.com/statistikat/coronaDAT
I am trying to make it crawl a directory structure (all folders and subfolders) in the GitHub repo, identify the ZIP files, then extract specific CSV files from ZIP files and combine the CSVs. In this case, taking all the CSV files titled "Bezirke.csv" and combining them into one.
I have a working version of the script that does this in the current working folder, but does not crawl the directory structure or go into subfolders. See this question.
I am now trying to use os.walk(rootPath)
to crawl the structure. It appears to be working, but stops with an error message:
Traceback (most recent call last):
File "merge_zip_entire_directory.py", line 21, in <module>
zip_file = ZipFile(filename)
File "/Users/matt/opt/anaconda3/lib/python3.7/zipfile.py", line 1240, in __init__
self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: '20200422_060000_orig_csv.zip'
I have verified that that particular zip file has a file named "Bezirke.csv". I don't understand why I'm getting the error message.
Here's is the full script:
import fnmatch
import os
import pandas as pd
from zipfile import ZipFile
#set root path
rootPath = r"/Users/matt/OneDrive/Documents/04 Employment/Employers/State Department/COVID-19/test/"
#set file pattern
pattern = '*.zip'
#initialize variables
df_master = pd.DataFrame()
flag = False
#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
#filter files that match pattern of .zip
for filename in fnmatch.filter(files, pattern):
#
zip_file = ZipFile(os.path.join(root, filename))
for text_file in zip_file.infolist():
if text_file.filename.endswith('Bezirke.csv'):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter=';',
header=0,
index_col=['Timestamp'],
parse_dates=['Timestamp']
)
if not flag:
df_master = df
flag = True
else:
df_master = pd.concat([df_master, df])
#sort index field Timestamp
df_master.sort_index(inplace=True)
#print master dataframe info
print(df_master.info())
#prepare date to export to csv
frame = df_master
#export to csv
try:
frame.to_csv( "combined_zip_Bezirke.csv", encoding='utf-8-sig')
print("Export to CSV Successful")
except:
print("Export to CSV Failed")
You forgot to include the path - the filename returned by os.walk is just the filename, without the path leading to this filename. What you need is:
zip_file = ZipFile(os.path.join(root, filename))
Besides, your indents in the for loop are wrong, it must be:
for text_file in zip_file.infolist():
if text_file.filename.endswith('Bezirke.csv'):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter=';',
header=0,
index_col=['Timestamp'],
parse_dates=['Timestamp']
)
if not flag:
df_master = df
flag = True
else:
df_master = pd.concat([df_master, df])