Search code examples
python-3.xpandasnumpygenfromtxt

Iterate through multiple CSV's checking for an integer value in each file


I am new to python and can use any help I can get. I am on a win7 machine and am using python 3.5 (anaconda)

I am trying to iterate through multiple CSV files (10k +) within a folder, checking for any value within that file that exceeds a predefined threshold.

I would like to build a dictionary, or list/tuple (basically whatever most resembles an sql table) using a sub-string of the filename as a unique identifier for a name field and have another column with total count of files that had values that exceeded the given threshold.

I don't expect any of you to do this for me as it is great practice, but I would appreciate any recommendations for modules that may make this a bit easier.

I've been able to check a file for a value but this is only about 10 minutes into this quest and I am not sure how I'd iterate through multiple files and build the table etc.. Thanks!

import numpy as np
path = 'C:\\path' 
file = 'file.csv'
with open(path+file) as f:
    my_data = np.genfromtxt(path+file, delimiter = ",")
    for data in my_data:
        if -1 in my_data:
            print("it sure is")   

Solution

  • Here is working Pandas solution:

    import glob
    import os
    import pandas as pd
    
    all_files = glob.glob(r'd:/temp/csv/*.csv')
    
    threshold = 100
    
    data = []
    
    for f in all_files:
        data.append([os.path.basename(f),
                    (pd.read_csv(f, header=None) > threshold).sum().sum()])
    
    df = pd.DataFrame(data, columns=['file','count'])
    
    print(df)
    
    # optionally save DataFrame to SQL table (`conn` - is a SQLAlchemy connection)
    #df.to_sql('table_name', conn)
    

    Output:

        file  count
    0  1.csv      2
    1  2.csv      3
    

    Test Data:

    1.csv:

    1,2,3,400
    10,111,45,67
    

    2.csv:

    1,200,300,4
    10,222,45,67
    

    UPDATE:

    you can parse the first number from the filename this way:

    In [87]: import re
    
    In [88]: f
    Out[88]: '/path/to/touchscreen_data_123456_1456789456_178.16.66.3'
    
    In [89]: re.sub(r'.*_\D+_(\d+)_\d+.*', r'\1', f)
    Out[89]: '123456'