Search code examples
pythonpandasfor-loopiterationglob

Iterating through files in a folder


I am using pandas to read and analyse large NHS prescription datasets. They are individually saved as csv files in a folder. I want to apply the same couple of lines of code to each of these files individually - so I want to individually read each dataframe with pandas and then extract relevant data and perform a basic calculation.

This is my code so far:

import pandas as pd
import glob

path = "/Volumes/TOSHIBA EXT/Datasets/2015"
all_files = glob.glob(path + "/*.CSV")

for f in all_files:
    pd.read_csv(f,index_col=None, header=0, usecols=[2,4,5], names=['PRACTICE','BNF NAME', 'ITEMS'])
    f=f[f['BNF NAME'].str.contains('Ampicillin' and 'Amoxicillin' and 'Co-Amoxiclav')]
    print pd.to_numeric(f['ITEMS']).sum()

However the following error is coming up..

TypeError: string indices must be integers, not str 

I would ideally like to define a function that selects the relevant rows containing the strings: "Ampicillin", "Amoxicillin" and "Co-Amoxiclav"; and then totals the number of items for each of these (i.e totalling the column named 'ITEMS') which I could then use in a for loop to iterate through each file.

Would appreciate any pointers as to how to avoid this error and achieve the above.

Many thanks! :)


Solution

  • f is filename but you use it as DataFrame - f['BNF NAME'].

    You need

    df = pd.read_csv(...)
    

    and then you can use

    df['BNF NAME']
    
    df = df[ df['BNF NAME'] ... ]
    
    df['ITEMS']