Search code examples
pythonpandasjoin

Merge two pandas dataframe based on conditional


The objective is to combine two df row wise, if a predetermine condition is met. Specifically, if the difference between the column is less or equal than a threshold, then, join the row of the df.

Given two df: df1 and df2, the following code partially achieve the objective.

import pandas as pd

df1 = pd.DataFrame ( {'time': [2, 3, 4, 24, 31]} )
df2 = pd.DataFrame (  {'time': [4.1, 24.7, 31.4, 5]} )
th = 0.9
all_comb=[]
for index, row in df1.iterrows ():
    for index2, row2 in df2.iterrows ():
        diff = abs ( row ['time'] - row2 ['time'] )
        if diff <= th:
            all_comb.append({'idx_1':index,'time_1':row ['time'], 'idx_2':index2,'time_2':row2 ['time']})
df_all = pd.DataFrame(all_comb)

outputted

       idx_1  time_1  idx_2  time_2
0      2       4      0     4.1
1      3      24      1    24.7
2      4      31      2    31.4

However, the above approach ignore certain information i.e., the value of 2 and 3 from the df1, and the value of 5 from df2.

The expected output should be something like

idx_1  time_1  idx_2  time_2

0      2       NA    NA
1      3       NA    NA    
2       4      0     4.1
3      24      1    24.7
4      31      2    31.4
NA     NA      3     5

Appreciate for any hint or any way that more compact and efficient than the proposed above.


Solution

  • You can perform a cross merge and then subset all the rows at once based on your condition. Then we concat, adding back any rows that had no conditions met from both DataFrames.

    import pandas as pd
    
    df1 = df1.reset_index().add_suffix('_1')
    df2 = df2.reset_index().add_suffix('_2')
    
    m = df1.merge(df2, how='cross')
    
    # Subset to all matches: |time_diff| <= thresh
    th = 0.9
    m = m[(m['time_1'] - m['time_2']).abs().le(th)]
    
    # Add back rows with no matches
    res = pd.concat([df1[~df1.index_1.isin(m.index_1)],
                     m,
                     df2[~df2.index_2.isin(m.index_2)]], ignore_index=True)
    

    print(res)
       index_1  time_1  index_2  time_2
    0      0.0     2.0      NaN     NaN
    1      1.0     3.0      NaN     NaN
    2      2.0     4.0      0.0     4.1
    3      3.0    24.0      1.0    24.7
    4      4.0    31.0      2.0    31.4
    5      NaN     NaN      3.0     5.0