Search code examples
python-3.xpandasdataframepandas-groupbypercentile

Pandas: how to drop the lowest 5th percentile for each indexed group?


I have the following issue with python pandas (I am relatively new to it): I have a simple dataset with a column for date, and a corresponding column of values. I am able to sort this Dataframe by date and value by doing the following:

df = df.sort_values(['date', 'value'],ascending=False)

I obtain this:

date       value
2019-11    100
2019-11    89
2019-11    87
2019-11    86   
2019_11    45
2019_11    33
2019_11    24
2019_11    11
2019_11    8
2019_11    5
2019-10    100 
2019-10    98
2019-10    96
2019-10    94
2019_10    94
2019_10    78
2019_10    74
2019_10    12
2019_10    3
2019_10    1

Now, what I would like to do, is to get rid of the lowest fifth percentile for the value column for EACH month (each group). I know that I should use a groupby method, and perhaps also a function:

df = df.sort_values(['date', 'value'],ascending=False).groupby('date', group_keys=False).apply(<???>)

The ??? is where I am struggling. I know how to suppress the lowest 5th percentile on a sorted Dataframe as a WHOLE, for instance by doing:

df = df[df.value > df.value.quantile(.05)]

This was the object of another post on StackOverflow. I know that I can also use numpy to do this, and that it is much faster, but my issue is really how to apply that to EACH GROUP independently (each portion of the value column sorted by month) in the Dataframe, not just the whole Dataframe.

Any help would be greatly appreciated Thank you so very much, Kind regards, Berti


Solution

  • Use GroupBy.transform with lambda function for Series with same size like original DataFrame, so possible filter by boolean indexing:

    df = df.sort_values(['date', 'value'],ascending=False)
    
    q = df.groupby('date')['value'].transform(lambda x: x.quantile(.05))
    df = df[df.value > q]
    print (df)
           date  value
    4   2019_11     45
    5   2019_11     33
    6   2019_11     24
    7   2019_11     11
    8   2019_11      8
    14  2019_10     94
    15  2019_10     78
    16  2019_10     74
    17  2019_10     12
    18  2019_10      3
    0   2019-11    100
    1   2019-11     89
    2   2019-11     87
    10  2019-10    100
    11  2019-10     98
    12  2019-10     96