I am new to python and can use any help I can get. I am on a win7 machine and am using python 3.5 (anaconda)
I am trying to iterate through multiple CSV files (10k +) within a folder, checking for any value within that file that exceeds a predefined threshold.
I would like to build a dictionary, or list/tuple (basically whatever most resembles an sql table) using a sub-string of the filename as a unique identifier for a name field and have another column with total count of files that had values that exceeded the given threshold.
I don't expect any of you to do this for me as it is great practice, but I would appreciate any recommendations for modules that may make this a bit easier.
I've been able to check a file for a value but this is only about 10 minutes into this quest and I am not sure how I'd iterate through multiple files and build the table etc.. Thanks!
import numpy as np
path = 'C:\\path'
file = 'file.csv'
with open(path+file) as f:
my_data = np.genfromtxt(path+file, delimiter = ",")
for data in my_data:
if -1 in my_data:
print("it sure is")
Here is working Pandas solution:
import glob
import os
import pandas as pd
all_files = glob.glob(r'd:/temp/csv/*.csv')
threshold = 100
data = []
for f in all_files:
data.append([os.path.basename(f),
(pd.read_csv(f, header=None) > threshold).sum().sum()])
df = pd.DataFrame(data, columns=['file','count'])
print(df)
# optionally save DataFrame to SQL table (`conn` - is a SQLAlchemy connection)
#df.to_sql('table_name', conn)
Output:
file count
0 1.csv 2
1 2.csv 3
Test Data:
1.csv:
1,2,3,400
10,111,45,67
2.csv:
1,200,300,4
10,222,45,67
UPDATE:
you can parse the first number from the filename this way:
In [87]: import re
In [88]: f
Out[88]: '/path/to/touchscreen_data_123456_1456789456_178.16.66.3'
In [89]: re.sub(r'.*_\D+_(\d+)_\d+.*', r'\1', f)
Out[89]: '123456'