I have the following issue with python pandas (I am relatively new to it): I have a simple dataset with a column for date, and a corresponding column of values. I am able to sort this Dataframe by date and value by doing the following:
df = df.sort_values(['date', 'value'],ascending=False)
I obtain this:
date value
2019-11 100
2019-11 89
2019-11 87
2019-11 86
2019_11 45
2019_11 33
2019_11 24
2019_11 11
2019_11 8
2019_11 5
2019-10 100
2019-10 98
2019-10 96
2019-10 94
2019_10 94
2019_10 78
2019_10 74
2019_10 12
2019_10 3
2019_10 1
Now, what I would like to do, is to get rid of the lowest fifth percentile for the value column for EACH month (each group). I know that I should use a groupby method, and perhaps also a function:
df = df.sort_values(['date', 'value'],ascending=False).groupby('date', group_keys=False).apply(<???>)
The ??? is where I am struggling. I know how to suppress the lowest 5th percentile on a sorted Dataframe as a WHOLE, for instance by doing:
df = df[df.value > df.value.quantile(.05)]
This was the object of another post on StackOverflow. I know that I can also use numpy to do this, and that it is much faster, but my issue is really how to apply that to EACH GROUP independently (each portion of the value column sorted by month) in the Dataframe, not just the whole Dataframe.
Any help would be greatly appreciated Thank you so very much, Kind regards, Berti
Use GroupBy.transform
with lambda function for Series with same size like original DataFrame
, so possible filter by boolean indexing
:
df = df.sort_values(['date', 'value'],ascending=False)
q = df.groupby('date')['value'].transform(lambda x: x.quantile(.05))
df = df[df.value > q]
print (df)
date value
4 2019_11 45
5 2019_11 33
6 2019_11 24
7 2019_11 11
8 2019_11 8
14 2019_10 94
15 2019_10 78
16 2019_10 74
17 2019_10 12
18 2019_10 3
0 2019-11 100
1 2019-11 89
2 2019-11 87
10 2019-10 100
11 2019-10 98
12 2019-10 96