I wrote some code to combine multiple CSV's that are interpretered with Pandas and appended to one combined CSV.
The issue I have is that the CSV files are delivered by multiple parties (monthly) and often contain differences with regard to column names, while they essentially contain the same information. For instance:
CSV 1 | ID | Instance number | | -------- | -------------- | | 1 | 401421 | | 2 | 420138 |
CSV 2 | ID | Instance NO | | -------- | -------------- | | 1 | 482012 | | 2 | 465921 |
This will result in two columns in the combined file, Instance Number & Instance NO unless I rename the column beforehand while the idea is to automatically process all files without intervention beforehand.
The solution that should work is to use combine_first or fillna, but next time the column may be entered as e.g. Instance No/number.
Since improving data delivery isn't an option, is there any smart way to solve issues like this without having to write out all possible variations and remap them to one leading column?
Thanks in advance!
I think first you need to have a dictionary of all possible names or you can quickly add those whenever you get a new one and rename the column names. for example
general_dict = { 'SLNO': ['Sl No', 'SNo']}
col_list = all_df.columns.to_list()
rename_dict = {}
for col in col_list:
for key, val in general_dict.items():
if col in val:
rename_dict[col] = key
break
all_df.rename(columns=rename_dict, inplace=True)