Search code examples
pythondataframeiterationlist-comprehensionvectorization

Selectively Combine several dataframes based on conditions in each dataframe


I have three different time based dataframes with 10s of thousands of data points.

df1['time']  = 1, 2, 3, 4, 5
df1['data1'] = 1, 0, 0, 1, 0

df2['time']  = 1, 3, 5, 7, 9 
df2['data2'] = a, b, c, d, e

df3['time']  = 3, 4, 5, 6, 7
df3['data3'] = z, y, x, w, v

I want to combine these dataframes into 1 dataframe only where they have the same time point existing. In the above dataframes only time 3 and 5 exist concurrently across all three so merge those data points into the final dataframe.

df4['time']  = 3, 5
df4['data1'] = 0, 0
df4['data2'] = b, c
df4['data3'] = z, x

I've been trying to avoid iterating over the dataframes with if statements because of the numerous data points and the answer in How to iterate over rows in a DataFrame in Pandas from cs95 basically saying to avoid iterating if possible.

Am I stuck iterating through the dataframes or is there vectorization/list comprehension method I can follow?


Solution

  • You want to use pd.merge. I think of it like a SQL join, it works very similarly. In the example below I'm doing 2 merges (a.k.a. joins), first with df1 to df2 as an inner join on the time column, then that merged dataframe is merged with df3 using inner join on time again.

    import pandas as pd
    
    df1 = pd.DataFrame({'time': [1,2,3,4,5], 'data1': [1,0,0,1,0]})
    df2 = pd.DataFrame({'time': [1,3,5,7,9], 'data2': ['a','b','c','d','e']})
    df3 = pd.DataFrame({'time': [3,4,5,6,7], 'data3': ['z','y','x','w','v']})
    df4 = df1.merge(df2, how='inner', on='time').merge(df3, how='inner', on='time')
    df4
    Out[211]: 
       time  data1 data2 data3
    0     3      0     b     z
    1     5      0     c     x
    

    inner join only keeps values that exist in both dataframes. the on field indicates which column(s) need to match to join.

    Notice that the final result will include data1, data2, and data3 from each of the columns, if you have columns that are the same name in the datasets and are not included in on then they will be renamed with _df1, _df2, _df3 respectively.