Search code examples
pythonpandasdataframemergedata-analysis

How to merge multiple dataframes


I have different dataframes and need to merge them together based on the date column. If I only had two dataframes, I could use df1.merge(df2, on='date'), to do it with three dataframes, I use df1.merge(df2.merge(df3, on='date'), on='date'), however it becomes really complex and unreadable to do it with multiple dataframes.

All dataframes have one column in common -date, but they don't have the same number of rows nor columns and I only need those rows in which each date is common to every dataframe.

So, I'm trying to write a recursion function that returns a dataframe with all data but it didn't work. How should I merge multiple dataframes then?

I tried different ways and got errors like out of range, keyerror 0/1/2/3 and can not merge DataFrame with instance of type <class 'NoneType'>.

This is the script I wrote:

dfs = [df1, df2, df3] # list of dataframes

def mergefiles(dfs, countfiles, i=0):
    if i == (countfiles - 2): # it gets to the second to last and merges it with the last
        return
    
    dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date')
    return dfm

print(mergefiles(dfs, len(dfs)))

An example: df_1:

May 19, 2017;1,200.00;0.1%
May 18, 2017;1,100.00;0.1%
May 17, 2017;1,000.00;0.1%
May 15, 2017;1,901.00;0.1%

df_2:

May 20, 2017;2,200.00;1000000;0.2%
May 18, 2017;2,100.00;1590000;0.2%
May 16, 2017;2,000.00;1230000;0.2%
May 15, 2017;2,902.00;1000000;0.2%

df_3:

May 21, 2017;3,200.00;2000000;0.3%
May 17, 2017;3,100.00;2590000;0.3%
May 16, 2017;3,000.00;2230000;0.3%
May 15, 2017;3,903.00;2000000;0.3%

Expected merge result:

May 15, 2017;  1,901.00;0.1%;  2,902.00;1000000;0.2%;   3,903.00;2000000;0.3%   

Solution

  • Short answer

    df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                                how='outer'), data_frames)
    

    Long answer

    Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.

    Just simply merge with DATE as the index and merge using OUTER method (to get all the data).

    import pandas as pd
    from functools import reduce
    
    df1 = pd.read_table('file1.csv', sep=',')
    df2 = pd.read_table('file2.csv', sep=',')
    df3 = pd.read_table('file3.csv', sep=',')
    

    Now, basically load all the files you have as data frame into a list. And, then merge the files using merge or reduce function.

    # compile the list of dataframes you want to merge
    data_frames = [df1, df2, df3]
    

    Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.

    To keep the values that belong to the same date you need to merge it on the DATE

    df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                                how='outer'), data_frames)
    
    # if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as
    
    df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                                how='outer'), data_frames).fillna('void')
    
    • Now, the output will the values from the same date on the same lines.
    • You can fill the non existing data from different frames for different columns using fillna().

    Then write the merged data to the csv file if desired.

    pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)
    

    This should give you

    DATE VALUE1 VALUE2 VALUE3 ....