I have these 2 tables :
tab1 tab2
col1 col2 col1 col2 col3
A 2017 A 2017 foo
A 2018 A 2019 fii
A 2019 A 2020 fee
B 2017 B 2017 boo
B 2019 B 2020 bii
C 2017 C 2017 coo
C 2018 C 2018 cii
I want to merge using Python these two tables with both col1
and col2
as keys. But my problem is, for example, on the second row, on tab1, I have (A, 2018) but on tab2, I have (A, 2019 or 2017) so the row will be NaN
in the merged table.
So, my question is how can I fill those row by the nearest corresponding year from tab2
? Instead of being an NaN
row, it will be, for example, filled with (A, 2019).
So the result would be something like this :
merged_tab
col1 col2 col3
A 2017 foo
A 2018 fii
A 2019 fii
B 2017 boo
B 2019 boo
C 2017 coo
C 2018 cii
Thank you!
This looks like a merge_asof
.
Here on the nearest previous value:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='backward')
.set_index('index').reindex(tab1.index)
)
Output:
col1 col2 col3
0 A 2017 foo
1 A 2018 foo # 2018 is absent, let's take 2017
2 A 2019 fii
3 B 2017 boo
4 B 2019 boo
5 C 2017 coo
6 C 2018 cii
One the nearest following value:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='forward')
.set_index('index').reindex(tab1.index)
)
Output:
col1 col2 col3
0 A 2017 foo
1 A 2018 fii # 2018 is absent, let's take 2019
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii
5 C 2017 coo
6 C 2018 cii
Here on the nearest overall value:
out = (pd.merge_asof(tab1.reset_index().sort_values(by='col2'),
tab2.sort_values(by='col2'),
on='col2', by='col1', direction='nearest')
.set_index('index').reindex(tab1.index)
)
Output:
col1 col2 col3
0 A 2017 foo
1 A 2018 foo # 2017/2019 are equidistant, let's take 2017
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii # 2020 is closer than 2017
5 C 2017 coo
6 C 2018 cii
And if you want to merge on the nearest value, with the forward one in case of equality:
out = (pd.merge_asof(tab1.reset_index().eval('col2=-col2').sort_values(by='col2'),
tab2.eval('col2=-col2').sort_values(by='col2'),
on='col2', by='col1', direction='nearest')
.set_index('index').reindex(tab1.index)
.eval('col2=-col2')
)
Output:
col1 col2 col3
0 A 2017 foo
1 A 2018 fii # 2017 and 2019 are equality distant, give 2019 priority
2 A 2019 fii
3 B 2017 boo
4 B 2019 bii # 2020 is closer than 2017
5 C 2017 coo
6 C 2018 cii