Search code examples
pythonpandasfunctiondataframememory-efficient

Elegant way to read multiple csv files and convert them to series objects


I have 5 csv files to read but the column names and operations to do using that csv files remain the same.

For example, in my code I do the 3 steps below (but repeat it for all 5 files)

a) I read csv and store them in a dataframe

b) convert the dataframe contents to upper case

c) drop duplicates using subset = sourceName column

d) create a series object

test_map = pd.read_csv('test_map.csv')
test_map = test_map .apply(lambda x: x.astype(str).str.upper())
test_map = test_map .drop_duplicates(subset=['sourceName'])
test_Id_map = test_map.set_index(['sourceName'])['Id']

val_op_map = pd.read_csv('op_val_map.csv')
val_op_map = val_op_map.apply(lambda x: x.astype(str).str.upper())
val_op_map = val_op_map.drop_duplicates(subset=['sourceName'])
val_op_id_map = val_op_map.set_index(['sourceName'])['Id']

test2_map = pd.read_csv('test2_map.csv')
test2_map = test2_map.apply(lambda x: x.astype(str).str.upper())
test2_map = test2_map.drop_duplicates(subset=['sourceName'])
test2_id_map = test2_map.set_index(['sourceName'])['Id']

visit_map = pd.read_excel('visit_map.xlsx')
visit_map = visit_map.apply(lambda x: x.astype(str).str.upper())
visit_map = visit_map.drop_duplicates(subset=['sourceName'])
visit_Id_map = visit_map.set_index(['sourceName'])['Id']

demo_map = pd.read_excel('demo_map.xlsx')
demo_map = demo_map.apply(lambda x: x.astype(str).str.upper())
demo_map = demo_map.drop_duplicates(subset=['sourceName'])
demo_id_map = demo_map.set_index(['sourceName'])['Id']

I feel repeating the same line of code multiple times for different files is neither efficient nor elegant. Is there any other better approach to write this?

I finally would like to have 5 series objects as output which are test_Id_map, demo_id_map,visit_id_map,test2_id_map, val_op_id_map


Solution

  • assuming you want a list of processed dataframes along with their id maps, you could do this :

    filenames = ['test_map.csv','op_val_map.csv', 'test2_map.csv','visit_map.xlsx','demo_map.xlsx']
    def preprocessing(fname):
      df= pd.read_csv(fname)
      df= df.apply(lambda x: x.astype(str).str.upper())
      df= df.drop_duplicates(subset=['sourceName'])
      df_Id_map = df.set_index(['sourceName'])['Id']
      return df, df_Id_map
    
    processed = [prerprocessing(fname) for fname in filenames]
    

    Edit: if you want to handle different file types for input you could add it as if condition to your preprocessing function, for example:

    def preprocessing(fname):
      if fname.endswith('.csv'):
        df= pd.read_csv(fname)
        df= df.apply(lambda x: x.astype(str).str.upper())
        df= df.drop_duplicates(subset=['sourceName'])
        df_Id_map = df.set_index(['sourceName'])['Id']
      elif fname.endswith('.xlsx'):
        #do your things
      return df, df_Id_map