Suppose I have multiple dataframes:
print (df1)
datetime A
0 2012-08-14 07:00 1
1 2012-08-14 07:01 2
2 2012-08-14 08:15 3
... ...
192908 2013-08-14 16:00 600
192948 2013-08-14 16:15 700
192949 2013-08-14 16:57 900
print (df2)
datetime B
0 2012-08-14 07:00 100
1 2012-08-14 07:15 200
2 2012-08-14 07:30 300
... ...
12140 2013-09-24 15:45 50
12141 2013-09-24 16:00 60
12142 2013-09-24 16:15 70
How do I create a new df containing only the rows for which there is a value in columns A
and B
at the same datetime? I tried using the isin
function:
df1 = df1[df1['date'].isin(df2['date'])]
but this only does a one-way check, i.e. only values of A
for which there exists a value of B
at the same datetime
are retained, but if there are extra values in B
for datetimes that do not exist in A
then these are left in df2.
I can repeat the operation in the opposite direction to resolve this:
df2 = df2[df2['date'].isin(df1['date'])]
but for >2 dataframes (I have around fifty in my present work) this becomes extremely long and inefficient because it would be necessary to go through every possible paired combination between the full set of dataframes. For example, a third dataframe, df3, would first need to be checked against df1 and df2, but if it contained datetimes that exist neither in df1 nor df2 then df1 and df2 would in turn need to be re-checked back against df3.
Desired output is to have re-defined all of the dataframes such that they contain only values of A
, B
, etc. with a matching datetime value.
This is a join / merge operation. Standard Codd relational theory/algebra.
import io
df1 = pd.read_csv(io.StringIO(""" datetime A
0 2012-08-14 07:00 1
1 2012-08-14 07:01 2
2 2012-08-14 08:15 3
192908 2013-08-14 16:00 600
192948 2013-08-14 16:15 700
192949 2013-08-14 16:57 900"""), sep="\s\s+", engine="python")
df2 = pd.read_csv(io.StringIO(""" datetime B
0 2012-08-14 07:00 100
1 2012-08-14 07:15 200
2 2012-08-14 07:30 300
12140 2013-09-24 15:45 50
12141 2013-09-24 16:00 60
12142 2013-09-24 16:15 70"""), sep="\s\s+", engine="python")
pd.merge(df1,df2, on="datetime", how="inner")
datetime A B
0 2012-08-14 07:00 1 100
import io, random, functools
# generate a list of dataframes for merge... start with two sample ones
dfs = [df1, df2]
# generate longer list of dataframes, rename columns to add some interest for merge :-)
dfs = [dfs[random.randint(0, len(dfs)-1)].pipe(lambda d: d.rename(columns={d.columns[1]:f"{d.columns[1]}_{i}"})) for i in range(8)]
# and one line merge the whole list of dataframes
functools.reduce(lambda left,right: pd.merge(left,right,on='datetime'), dfs)
datetime | A_0 | B_1 | B_2 | B_3 | A_4 | B_5 | B_6 | A_7 | |
---|---|---|---|---|---|---|---|---|---|
0 | 2012-08-14 07:00 | 1 | 100 | 100 | 100 | 1 | 100 | 100 | 1 |