I have 5 csv files to read but the column names and operations to do using that csv files remain the same.
For example, in my code I do the 3 steps below (but repeat it for all 5 files)
a) I read csv and store them in a dataframe
b) convert the dataframe contents to upper case
c) drop duplicates using subset = sourceName
column
d) create a series object
test_map = pd.read_csv('test_map.csv')
test_map = test_map .apply(lambda x: x.astype(str).str.upper())
test_map = test_map .drop_duplicates(subset=['sourceName'])
test_Id_map = test_map.set_index(['sourceName'])['Id']
val_op_map = pd.read_csv('op_val_map.csv')
val_op_map = val_op_map.apply(lambda x: x.astype(str).str.upper())
val_op_map = val_op_map.drop_duplicates(subset=['sourceName'])
val_op_id_map = val_op_map.set_index(['sourceName'])['Id']
test2_map = pd.read_csv('test2_map.csv')
test2_map = test2_map.apply(lambda x: x.astype(str).str.upper())
test2_map = test2_map.drop_duplicates(subset=['sourceName'])
test2_id_map = test2_map.set_index(['sourceName'])['Id']
visit_map = pd.read_excel('visit_map.xlsx')
visit_map = visit_map.apply(lambda x: x.astype(str).str.upper())
visit_map = visit_map.drop_duplicates(subset=['sourceName'])
visit_Id_map = visit_map.set_index(['sourceName'])['Id']
demo_map = pd.read_excel('demo_map.xlsx')
demo_map = demo_map.apply(lambda x: x.astype(str).str.upper())
demo_map = demo_map.drop_duplicates(subset=['sourceName'])
demo_id_map = demo_map.set_index(['sourceName'])['Id']
I feel repeating the same line of code multiple times for different files is neither efficient nor elegant. Is there any other better approach to write this?
I finally would like to have 5 series objects as output which are test_Id_map
, demo_id_map
,visit_id_map
,test2_id_map
, val_op_id_map
assuming you want a list of processed dataframes along with their id maps, you could do this :
filenames = ['test_map.csv','op_val_map.csv', 'test2_map.csv','visit_map.xlsx','demo_map.xlsx']
def preprocessing(fname):
df= pd.read_csv(fname)
df= df.apply(lambda x: x.astype(str).str.upper())
df= df.drop_duplicates(subset=['sourceName'])
df_Id_map = df.set_index(['sourceName'])['Id']
return df, df_Id_map
processed = [prerprocessing(fname) for fname in filenames]
Edit: if you want to handle different file types for input you could add it as if condition to your preprocessing function, for example:
def preprocessing(fname):
if fname.endswith('.csv'):
df= pd.read_csv(fname)
df= df.apply(lambda x: x.astype(str).str.upper())
df= df.drop_duplicates(subset=['sourceName'])
df_Id_map = df.set_index(['sourceName'])['Id']
elif fname.endswith('.xlsx'):
#do your things
return df, df_Id_map