Search code examples
python-3.xpandasdataframemulti-indexoutliers

Removing outliers based on column variables or multi-index in a dataframe


This is another IQR outlier question. I have a dataframe that looks something like this:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
df.loc[0:49,'Season'] = 'Spring'
df.loc[50:99,'Season'] = 'Fall'
df.loc[0:24,'Treatment'] = 'Placebo'
df.loc[25:49,'Treatment'] = 'Drug'
df.loc[50:74,'Treatment'] = 'Placebo'
df.loc[75:99,'Treatment'] = 'Drug'
df = df[['Season','Treatment','red','yellow','green']]
df

I would like to find and remove the outliers for each condition (i.e. Spring Placebo, Spring Drug, etc). Not the whole row, just the cell. And would like to do it for each of the 'red', 'yellow', 'green' columns.

Is there way to do this without breaking the dataframe into a whole bunch of sub dataframes with all of the conditions broken out separately? I'm not sure if this would be easier if 'Season' and 'Treatment' were handled as columns or indices. I'm fine with either way.

I've tried a few things with .iloc and .loc but I can't seem to make it work.


Solution

  • If need replace outliers by missing values use GroupBy.transform with DataFrame.quantile, then compare for lower and greater values by DataFrame.lt and DataFrame.gt, chain masks by | for bitwise OR and set missing values in DataFrame.mask, default replacement, so not specified:

    np.random.seed(2020)
    df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=('red','yellow','green'))
    df.loc[0:49,'Season'] = 'Spring'
    df.loc[50:99,'Season'] = 'Fall'
    df.loc[0:24,'Treatment'] = 'Placebo'
    df.loc[25:49,'Treatment'] = 'Drug'
    df.loc[50:74,'Treatment'] = 'Placebo'
    df.loc[75:99,'Treatment'] = 'Drug'
    df = df[['Season','Treatment','red','yellow','green']]
    
    g = df.groupby(['Season','Treatment'])
    df1 = g.transform('quantile', 0.05)
    df2 = g.transform('quantile', 0.95)
    
    c = df.columns.difference(['Season','Treatment'])
    mask = df[c].lt(df1) | df[c].gt(df2)
    df[c] = df[c].mask(mask)
    
    print (df)
        Season Treatment   red  yellow  green
    0   Spring   Placebo   NaN     NaN   67.0
    1   Spring   Placebo  67.0    91.0    3.0
    2   Spring   Placebo  71.0    56.0   29.0
    3   Spring   Placebo  48.0    32.0   24.0
    4   Spring   Placebo  74.0     9.0   51.0
    ..     ...       ...   ...     ...    ...
    95    Fall      Drug  90.0    35.0   55.0
    96    Fall      Drug  40.0    55.0   90.0
    97    Fall      Drug   NaN    54.0    NaN
    98    Fall      Drug  28.0    50.0   74.0
    99    Fall      Drug   NaN    73.0   11.0
    
    [100 rows x 5 columns]