Search code examples
pythonpandasdataframemergeleft-join

Merging pandas dataframes with 2 keys that are not the same


I have these 2 tables :

tab1         tab2
col1 col2    col1 col2  col3
A    2017    A    2017  foo
A    2018    A    2019  fii
A    2019    A    2020  fee
B    2017    B    2017  boo
B    2019    B    2020  bii
C    2017    C    2017  coo
C    2018    C    2018  cii

I want to merge using Python these two tables with both col1 and col2 as keys. But my problem is, for example, on the second row, on tab1, I have (A, 2018) but on tab2, I have (A, 2019 or 2017) so the row will be NaN in the merged table.

So, my question is how can I fill those row by the nearest corresponding year from tab2? Instead of being an NaN row, it will be, for example, filled with (A, 2019).

So the result would be something like this :

merged_tab
col1 col2  col3
A    2017  foo
A    2018  fii
A    2019  fii
B    2017  boo
B    2019  boo
C    2017  coo
C    2018  cii

Thank you!


Solution

  • This looks like a merge_asof.

    Here on the nearest previous value:

    out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                         tab2.sort_values(by='col2'),
                         on='col2', by='col1', direction='backward')
             .set_index('index').reindex(tab1.index)
          )
    

    Output:

      col1  col2 col3
    0    A  2017  foo
    1    A  2018  foo  # 2018 is absent, let's take 2017
    2    A  2019  fii
    3    B  2017  boo
    4    B  2019  boo
    5    C  2017  coo
    6    C  2018  cii
    

    One the nearest following value:

    out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                         tab2.sort_values(by='col2'),
                         on='col2', by='col1', direction='forward')
             .set_index('index').reindex(tab1.index)
          )
    

    Output:

      col1  col2 col3
    0    A  2017  foo
    1    A  2018  fii  # 2018 is absent, let's take 2019
    2    A  2019  fii
    3    B  2017  boo
    4    B  2019  bii
    5    C  2017  coo
    6    C  2018  cii
    

    Here on the nearest overall value:

    out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
                         tab2.sort_values(by='col2'),
                         on='col2', by='col1', direction='nearest')
             .set_index('index').reindex(tab1.index)
          )
    

    Output:

      col1  col2 col3
    0    A  2017  foo
    1    A  2018  foo  # 2017/2019 are equidistant, let's take 2017
    2    A  2019  fii
    3    B  2017  boo
    4    B  2019  bii  # 2020 is closer than 2017
    5    C  2017  coo
    6    C  2018  cii
    

    And if you want to merge on the nearest value, with the forward one in case of equality:

    out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
                         tab2.eval('col2=-col2').sort_values(by='col2'),
                         on='col2', by='col1', direction='nearest')
             .set_index('index').reindex(tab1.index)
             .eval('col2=-col2')
          )
    

    Output:

      col1  col2 col3
    0    A  2017  foo
    1    A  2018  fii  # 2017 and 2019 are equality distant, give 2019 priority 
    2    A  2019  fii
    3    B  2017  boo
    4    B  2019  bii  # 2020 is closer than 2017
    5    C  2017  coo
    6    C  2018  cii