pythonpandasgroup-byquantile

How to flag tukey outliers using python pandas groupby

I would like to use pandas groupby to flag values in a df that are outliers. I think I've got it working, but as I'm new to python, wanted to ask if there is a more obvious / pythonic approach.

Given input data with two groups, two variables X and Y:

``````n=10000
df= pd.DataFrame({'key': ['a']*n+['b']*n
,"x"  : np.hstack((
np.random.normal(10, 1.0, size=n)
,np.random.normal(100, 1.0, size=n)
))
,"y"  : np.hstack((
np.random.normal(20, 1.0, size=n)
,np.random.normal(200, 1.0, size=n)
))
})
``````

To identify outliers I need to calculate the quartiles and inter-quartile range for each group to calculate the limits. Seemed reasonable to create a function:

``````def get_outlier(x,tukeymultiplier=2):
Q1=x.quantile(.25)
Q3=x.quantile(.75)
IQR=Q3-Q1
lowerlimit = Q1 - tukeymultiplier*IQR
upperlimit = Q3 + tukeymultiplier*IQR
return (x<lowerlimit) | (x>upperlimit)
``````

And then use groupby and call the function via transform, e.g.:

``````g=df.groupby('key')[['x','y']]
df['x_outlierflag']=g.transform(get_outlier).x
df['y_outlierflag']=g.transform(get_outlier).y

df.loc[df.x_outlierflag==True]
df.loc[df.y_outlierflag==True]
``````

I'm not worried about performance at this point, because the data are small. But not sure if there is a more natural way to do this? For example, it's not clear to me how apply() differs from transform(). Is there an apply() approach that would be better?

Solution

• Here is the alternative more efficient and concise approach with `groupby` and `transform`

``````c = ['x', 'y']
Q1 = df.groupby('key')[c].transform('quantile', 0.25)
Q3 = df.groupby('key')[c].transform('quantile', 0.75)
IQR = Q3 - Q1

lower = df[c].lt(Q1 - IQR * tukeymultiplier)
upper = df[c].gt(Q3 + IQR * tukeymultiplier)

``````

``````  key           x           y  x_outlier_flag  y_outlier_flag
0   a   37.179263  124.097101           False            True
1   a  -20.823970  -22.472672           False           False
2   a  123.438609 -121.588755            True            True
3   a  -53.817197   31.569917           False           False
4   a  -83.406668   48.850183            True           False
5   a  125.069615 -119.059600            True            True
6   a -177.185957  -39.235412            True           False
7   a  -57.935018  -36.746727           False           False
8   a   95.881448  -27.108608            True           False
9   a  -15.649324   86.914889           False            True
....
``````