I am using pandas to read and analyse large NHS prescription datasets. They are individually saved as csv files in a folder. I want to apply the same couple of lines of code to each of these files individually - so I want to individually read each dataframe with pandas and then extract relevant data and perform a basic calculation.
This is my code so far:
import pandas as pd
import glob
path = "/Volumes/TOSHIBA EXT/Datasets/2015"
all_files = glob.glob(path + "/*.CSV")
for f in all_files:
pd.read_csv(f,index_col=None, header=0, usecols=[2,4,5], names=['PRACTICE','BNF NAME', 'ITEMS'])
f=f[f['BNF NAME'].str.contains('Ampicillin' and 'Amoxicillin' and 'Co-Amoxiclav')]
print pd.to_numeric(f['ITEMS']).sum()
However the following error is coming up..
TypeError: string indices must be integers, not str
I would ideally like to define a function that selects the relevant rows containing the strings: "Ampicillin", "Amoxicillin" and "Co-Amoxiclav"; and then totals the number of items for each of these (i.e totalling the column named 'ITEMS') which I could then use in a for loop to iterate through each file.
Would appreciate any pointers as to how to avoid this error and achieve the above.
Many thanks! :)
f
is filename but you use it as DataFrame - f['BNF NAME']
.
You need
df = pd.read_csv(...)
and then you can use
df['BNF NAME']
df = df[ df['BNF NAME'] ... ]
df['ITEMS']