Search code examples
pythonpandasdataframefilterisin

How to retain rows from multiple dataframes with common column value?


Suppose I have multiple dataframes:

print (df1)
                datetime    A
0       2012-08-14 07:00    1
1       2012-08-14 07:01    2
2       2012-08-14 08:15    3
                     ...  ...
192908  2013-08-14 16:00  600
192948  2013-08-14 16:15  700
192949  2013-08-14 16:57  900

print (df2)
               datetime    B    
0      2012-08-14 07:00  100
1      2012-08-14 07:15  200
2      2012-08-14 07:30  300
                    ...  ...
12140  2013-09-24 15:45   50
12141  2013-09-24 16:00   60
12142  2013-09-24 16:15   70

How do I create a new df containing only the rows for which there is a value in columns A and B at the same datetime? I tried using the isin function:

df1 = df1[df1['date'].isin(df2['date'])]

but this only does a one-way check, i.e. only values of A for which there exists a value of B at the same datetime are retained, but if there are extra values in B for datetimes that do not exist in A then these are left in df2.

I can repeat the operation in the opposite direction to resolve this:

df2 = df2[df2['date'].isin(df1['date'])]

but for >2 dataframes (I have around fifty in my present work) this becomes extremely long and inefficient because it would be necessary to go through every possible paired combination between the full set of dataframes. For example, a third dataframe, df3, would first need to be checked against df1 and df2, but if it contained datetimes that exist neither in df1 nor df2 then df1 and df2 would in turn need to be re-checked back against df3.

Desired output is to have re-defined all of the dataframes such that they contain only values of A, B, etc. with a matching datetime value.


Solution

  • This is a join / merge operation. Standard Codd relational theory/algebra.

    import io
    df1 = pd.read_csv(io.StringIO("""                datetime    A
    0       2012-08-14 07:00    1
    1       2012-08-14 07:01    2
    2       2012-08-14 08:15    3
    192908  2013-08-14 16:00  600
    192948  2013-08-14 16:15  700
    192949  2013-08-14 16:57  900"""), sep="\s\s+", engine="python")
    
    df2 = pd.read_csv(io.StringIO("""               datetime    B    
    0      2012-08-14 07:00  100
    1      2012-08-14 07:15  200
    2      2012-08-14 07:30  300
    12140  2013-09-24 15:45   50
    12141  2013-09-24 16:00   60
    12142  2013-09-24 16:15   70"""), sep="\s\s+", engine="python")
    
    pd.merge(df1,df2, on="datetime", how="inner")
    
    

    output

        datetime    A   B
    0   2012-08-14 07:00    1   100
    
    

    want to merge many data frames

    import io, random, functools
    
    # generate a list of dataframes for merge... start with two sample ones
    dfs = [df1, df2]
    # generate longer list of dataframes, rename columns to add some interest for merge :-)
    dfs = [dfs[random.randint(0, len(dfs)-1)].pipe(lambda d: d.rename(columns={d.columns[1]:f"{d.columns[1]}_{i}"})) for i in range(8)]
    
    # and one line merge the whole list of dataframes
    functools.reduce(lambda left,right: pd.merge(left,right,on='datetime'), dfs)
    
    
    datetime A_0 B_1 B_2 B_3 A_4 B_5 B_6 A_7
    0 2012-08-14 07:00 1 100 100 100 1 100 100 1