suppose I have the following data frame df
group value
a 1
a 0
a 3
b 1
b 4
For each value, I want to find the closest value in the same group ('a' or 'b'),and then output the difference between the value and the closest value to it.
The output should look like this:
group value min_diff
a 1 1
a 0 1
a 3 2
b 1 3
b 4 3
`
Thank you!
You can use a self-merge_asof
:
tmp = df.sort_values('value')
out = (pd.merge_asof(
tmp.reset_index(),
tmp.rename(columns={'value': 'value_'}),
by='group',
left_on='value', right_on='value_',
direction='nearest',
allow_exact_matches=False)
.assign(min_diff=lambda d: d['value'].sub(d.pop('value_')).abs())
.set_index('index').reindex(df.index)
)
If you can have duplicated values within a group (see example below) and would like to report 0
in this case, you further need to process the output:
out.loc[df.duplicated(subset=['group', 'value'], keep=False), 'min_diff'] = 0
Output:
group value min_diff
0 a 1 1
1 a 0 1
2 a 3 2
3 b 1 3
4 b 4 3
Intermediate after the merge_asof
:
index group value value_
0 1 a 0 1
1 0 a 1 0
2 3 b 1 4
3 2 a 3 1
4 4 b 4 1
Because of allow_exact_matches=False
, matching the same value is prevented.
Without correction:
group value min_diff
0 a 1 1
1 a 0 1
2 a 3 2
3 b 1 3
4 b 4 3
5 c 8 1
6 c 9 1 # matched with 8
7 c 9 1 #
After correction with out.loc[df.duplicated(subset=['group', 'value'], keep=False), 'min_diff'] = 0
:
group value min_diff
0 a 1 1
1 a 0 1
2 a 3 2
3 b 1 3
4 b 4 3
5 c 8 1
6 c 9 0 # matched with each other
7 c 9 0 #