python pandas function dataframe memory-efficient

Elegant way to read multiple csv files and convert them to series objects

I have 5 csv files to read but the column names and operations to do using that csv files remain the same.

For example, in my code I do the 3 steps below (but repeat it for all 5 files)

a) I read csv and store them in a dataframe

b) convert the dataframe contents to upper case

c) drop duplicates using subset = sourceName column

d) create a series object

test_map = pd.read_csv('test_map.csv')
test_map = test_map .apply(lambda x: x.astype(str).str.upper())
test_map = test_map .drop_duplicates(subset=['sourceName'])
test_Id_map = test_map.set_index(['sourceName'])['Id']

val_op_map = pd.read_csv('op_val_map.csv')
val_op_map = val_op_map.apply(lambda x: x.astype(str).str.upper())
val_op_map = val_op_map.drop_duplicates(subset=['sourceName'])
val_op_id_map = val_op_map.set_index(['sourceName'])['Id']

test2_map = pd.read_csv('test2_map.csv')
test2_map = test2_map.apply(lambda x: x.astype(str).str.upper())
test2_map = test2_map.drop_duplicates(subset=['sourceName'])
test2_id_map = test2_map.set_index(['sourceName'])['Id']

visit_map = pd.read_excel('visit_map.xlsx')
visit_map = visit_map.apply(lambda x: x.astype(str).str.upper())
visit_map = visit_map.drop_duplicates(subset=['sourceName'])
visit_Id_map = visit_map.set_index(['sourceName'])['Id']

demo_map = pd.read_excel('demo_map.xlsx')
demo_map = demo_map.apply(lambda x: x.astype(str).str.upper())
demo_map = demo_map.drop_duplicates(subset=['sourceName'])
demo_id_map = demo_map.set_index(['sourceName'])['Id']

I feel repeating the same line of code multiple times for different files is neither efficient nor elegant. Is there any other better approach to write this?

I finally would like to have 5 series objects as output which are test_Id_map, demo_id_map,visit_id_map,test2_id_map, val_op_id_map

Solution

assuming you want a list of processed dataframes along with their id maps, you could do this :

filenames = ['test_map.csv','op_val_map.csv', 'test2_map.csv','visit_map.xlsx','demo_map.xlsx']
def preprocessing(fname):
  df= pd.read_csv(fname)
  df= df.apply(lambda x: x.astype(str).str.upper())
  df= df.drop_duplicates(subset=['sourceName'])
  df_Id_map = df.set_index(['sourceName'])['Id']
  return df, df_Id_map

processed = [prerprocessing(fname) for fname in filenames]

Edit: if you want to handle different file types for input you could add it as if condition to your preprocessing function, for example:

def preprocessing(fname):
  if fname.endswith('.csv'):
    df= pd.read_csv(fname)
    df= df.apply(lambda x: x.astype(str).str.upper())
    df= df.drop_duplicates(subset=['sourceName'])
    df_Id_map = df.set_index(['sourceName'])['Id']
  elif fname.endswith('.xlsx'):
    #do your things
  return df, df_Id_map