This is another IQR outlier question. I have a dataframe that looks something like this:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df
I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.
Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.
I've tried a few things with .iloc and .loc but I can't seem to make it work.
If need replace outliers by missing values use GroupBy.transform
with DataFrame.quantile
, then compare for lower and greater values by DataFrame.lt
and DataFrame.gt
, chain masks by |
for bitwise OR
and set missing values in DataFrame.mask
, default replacement, so not specified:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
g = df.groupby(['Season','Treatment'])
df1 = g.transform('quantile', 0.05)
df2 = g.transform('quantile', 0.95)
c = df.columns.difference(['Season','Treatment'])
mask = df[c].lt(df1) | df[c].gt(df2)
df[c] = df[c].mask(mask)
print (df)
Season Treatment red yellow green
0 Spring Placebo NaN NaN 67.0
1 Spring Placebo 67.0 91.0 3.0
2 Spring Placebo 71.0 56.0 29.0
3 Spring Placebo 48.0 32.0 24.0
4 Spring Placebo 74.0 9.0 51.0
.. ... ... ... ... ...
95 Fall Drug 90.0 35.0 55.0
96 Fall Drug 40.0 55.0 90.0
97 Fall Drug NaN 54.0 NaN
98 Fall Drug 28.0 50.0 74.0
99 Fall Drug NaN 73.0 11.0
[100 rows x 5 columns]