Search code examples
pythonpandasdataframemedian

Finding the closest value of median value in duplicated rows of dataframe


I have a DataFrame which contains more than 2000 rows.

Here is a part of my DataFrame:

In [2]: df
Out[2]: 
     A    B     C     D
0    a    b    -1   3.5
1    a    b    -1    52
2    a    b    -1     2
3    a    b    -1     0
4    a    b     0    15
5    a    c    -1  1612
6    a    c     1    17
7    a    e     1    52
8    a    d    -1   412
9    a    d    -1   532

I would like to find the index of the closest (next) value of the median value of D column grouping by A, B and C and also add a new column as Next_Med to label it.

Here is the expected result :

     A    B     C     D   Next_Med
0    a    b    -1   3.5          1
1    a    b    -1    52          0
2    a    b    -1     2          0
3    a    b    -1     0          0
4    a    b     0    15          1
5    a    c    -1  1612          1
6    a    c     1    17          1
7    a    e     1    52          1
8    a    d    -1   412          0
9    a    d    -1   532          1

For example for a, b and -1 combination, the median value is 2.75 so I'd like to label 3.5 as Next_Med.


Solution

  • Try this following one-liner with groupby and tranform with lambda:

    >>> df['Next_Med'] = df.sort_values([*'ABC']).groupby([*'ABC'])['D'].transform(lambda x: x == min(x, key=lambda y: abs(y - x.median()))).astype(int).reset_index(drop=True)
    >>> df
       A  B  C       D  Next_Med
    0  a  b -1     3.5         1
    1  a  b -1    52.0         0
    2  a  b -1     2.0         0
    3  a  b -1     0.0         0
    4  a  b  0    15.0         1
    5  a  c -1  1612.0         1
    6  a  c  1    17.0         1
    7  a  e  1    52.0         1
    8  a  d -1   412.0         0
    9  a  d -1   532.0         1
    >>>