Search code examples
pythonpandasgroup-by

For each value, find the smallest distance within the group using Pandas


suppose I have the following data frame df

group value
a       1
a       0
a       3
b       1
b       4

For each value, I want to find the closest value in the same group ('a' or 'b'),and then output the difference between the value and the closest value to it.

The output should look like this:

group value min_diff
a       1    1
a       0    1
a       3    2
b       1    3
b       4    3

`

Thank you!


Solution

  • You can use a self-merge_asof:

    tmp = df.sort_values('value')
    
    out = (pd.merge_asof(
               tmp.reset_index(),
               tmp.rename(columns={'value': 'value_'}),
               by='group',
               left_on='value', right_on='value_',
               direction='nearest',
               allow_exact_matches=False)
             .assign(min_diff=lambda d: d['value'].sub(d.pop('value_')).abs())
             .set_index('index').reindex(df.index)
           )
    

    If you can have duplicated values within a group (see example below) and would like to report 0 in this case, you further need to process the output:

    out.loc[df.duplicated(subset=['group', 'value'], keep=False), 'min_diff'] = 0
    

    Output:

      group  value  min_diff
    0     a      1         1
    1     a      0         1
    2     a      3         2
    3     b      1         3
    4     b      4         3
    

    Intermediate after the merge_asof:

       index group  value  value_
    0      1     a      0       1
    1      0     a      1       0
    2      3     b      1       4
    3      2     a      3       1
    4      4     b      4       1
    

    handling duplicated values

    Because of allow_exact_matches=False, matching the same value is prevented.

    Without correction:

      group  value  min_diff
    0     a      1         1
    1     a      0         1
    2     a      3         2
    3     b      1         3
    4     b      4         3
    5     c      8         1
    6     c      9         1  # matched with 8
    7     c      9         1  #
    

    After correction with out.loc[df.duplicated(subset=['group', 'value'], keep=False), 'min_diff'] = 0:

      group  value  min_diff
    0     a      1         1
    1     a      0         1
    2     a      3         2
    3     b      1         3
    4     b      4         3
    5     c      8         1
    6     c      9         0  # matched with each other
    7     c      9         0  #