Search code examples
pythonpandascsvmerging-data

How to merge multiple raw input CSV's with pandas containing similar columns with slightly different names?


I wrote some code to combine multiple CSV's that are interpretered with Pandas and appended to one combined CSV.

The issue I have is that the CSV files are delivered by multiple parties (monthly) and often contain differences with regard to column names, while they essentially contain the same information. For instance:

CSV 1 | ID | Instance number | | -------- | -------------- | | 1 | 401421 | | 2 | 420138 |

CSV 2 | ID | Instance NO | | -------- | -------------- | | 1 | 482012 | | 2 | 465921 |

This will result in two columns in the combined file, Instance Number & Instance NO unless I rename the column beforehand while the idea is to automatically process all files without intervention beforehand.

The solution that should work is to use combine_first or fillna, but next time the column may be entered as e.g. Instance No/number.

Since improving data delivery isn't an option, is there any smart way to solve issues like this without having to write out all possible variations and remap them to one leading column?

Thanks in advance!


Solution

  • I think first you need to have a dictionary of all possible names or you can quickly add those whenever you get a new one and rename the column names. for example

    general_dict = { 'SLNO': ['Sl No', 'SNo']}
    
    col_list = all_df.columns.to_list()
    rename_dict = {}
    
    for col in col_list:
        for key, val in general_dict.items():
            if col in val:
                rename_dict[col] = key
    
                break
    all_df.rename(columns=rename_dict, inplace=True)