I have three different time based dataframes with 10s of thousands of data points.
df1['time'] = 1, 2, 3, 4, 5
df1['data1'] = 1, 0, 0, 1, 0
df2['time'] = 1, 3, 5, 7, 9
df2['data2'] = a, b, c, d, e
df3['time'] = 3, 4, 5, 6, 7
df3['data3'] = z, y, x, w, v
I want to combine these dataframes into 1 dataframe only where they have the same time point existing. In the above dataframes only time 3 and 5 exist concurrently across all three so merge those data points into the final dataframe.
df4['time'] = 3, 5
df4['data1'] = 0, 0
df4['data2'] = b, c
df4['data3'] = z, x
I've been trying to avoid iterating over the dataframes with if statements because of the numerous data points and the answer in How to iterate over rows in a DataFrame in Pandas from cs95 basically saying to avoid iterating if possible.
Am I stuck iterating through the dataframes or is there vectorization/list comprehension method I can follow?
You want to use pd.merge
. I think of it like a SQL join, it works very similarly. In the example below I'm doing 2 merges (a.k.a. joins), first with df1
to df2
as an inner join on the time column, then that merged dataframe is merged with df3
using inner join on time again.
import pandas as pd
df1 = pd.DataFrame({'time': [1,2,3,4,5], 'data1': [1,0,0,1,0]})
df2 = pd.DataFrame({'time': [1,3,5,7,9], 'data2': ['a','b','c','d','e']})
df3 = pd.DataFrame({'time': [3,4,5,6,7], 'data3': ['z','y','x','w','v']})
df4 = df1.merge(df2, how='inner', on='time').merge(df3, how='inner', on='time')
df4
Out[211]:
time data1 data2 data3
0 3 0 b z
1 5 0 c x
inner
join only keeps values that exist in both dataframes. the on
field indicates which column(s) need to match to join.
Notice that the final result will include data1
, data2
, and data3
from each of the columns, if you have columns that are the same name in the datasets and are not included in on
then they will be renamed with _df1
, _df2
, _df3
respectively.