pythonpandasgroup-byquantile# How to flag tukey outliers using python pandas groupby

I would like to use pandas groupby to flag values in a df that are outliers. I think I've got it working, but as I'm new to python, wanted to ask if there is a more obvious / pythonic approach.

Given input data with two groups, two variables X and Y:

```
n=10000
df= pd.DataFrame({'key': ['a']*n+['b']*n
,"x" : np.hstack((
np.random.normal(10, 1.0, size=n)
,np.random.normal(100, 1.0, size=n)
))
,"y" : np.hstack((
np.random.normal(20, 1.0, size=n)
,np.random.normal(200, 1.0, size=n)
))
})
```

To identify outliers I need to calculate the quartiles and inter-quartile range for each group to calculate the limits. Seemed reasonable to create a function:

```
def get_outlier(x,tukeymultiplier=2):
Q1=x.quantile(.25)
Q3=x.quantile(.75)
IQR=Q3-Q1
lowerlimit = Q1 - tukeymultiplier*IQR
upperlimit = Q3 + tukeymultiplier*IQR
return (x<lowerlimit) | (x>upperlimit)
```

And then use groupby and call the function via transform, e.g.:

```
g=df.groupby('key')[['x','y']]
df['x_outlierflag']=g.transform(get_outlier).x
df['y_outlierflag']=g.transform(get_outlier).y
df.loc[df.x_outlierflag==True]
df.loc[df.y_outlierflag==True]
```

I'm not worried about performance at this point, because the data are small. But not sure if there is a more natural way to do this? For example, it's not clear to me how apply() differs from transform(). Is there an apply() approach that would be better?

Solution

Here is the alternative more efficient and concise approach with `groupby`

and `transform`

```
c = ['x', 'y']
Q1 = df.groupby('key')[c].transform('quantile', 0.25)
Q3 = df.groupby('key')[c].transform('quantile', 0.75)
IQR = Q3 - Q1
lower = df[c].lt(Q1 - IQR * tukeymultiplier)
upper = df[c].gt(Q3 + IQR * tukeymultiplier)
df.join((lower | upper).add_suffix('_outlier_flag'))
```

```
key x y x_outlier_flag y_outlier_flag
0 a 37.179263 124.097101 False True
1 a -20.823970 -22.472672 False False
2 a 123.438609 -121.588755 True True
3 a -53.817197 31.569917 False False
4 a -83.406668 48.850183 True False
5 a 125.069615 -119.059600 True True
6 a -177.185957 -39.235412 True False
7 a -57.935018 -36.746727 False False
8 a 95.881448 -27.108608 True False
9 a -15.649324 86.914889 False True
....
```

- Python Jinja2 LaTeX Table
- Getting attributes of a class
- How can I print many significant figures in Python?
- How to allow list append() method to return the new list
- Calculate Last Friday of Month in Pandas
- Python type hint for Iterable[str] that isn't str
- How to iterate over a list in chunks
- How to exit the entire application from a Python thread?
- Running shell command and capturing the output
- How do I pass a variable by reference?
- Convert range(r) to list of strings of length 2 in python
- How can I get the start and end dates for each week?
- how to use send_message() in python-telegram-bot
- Python conditional replacement based on element type
- How can I count the number of items in an arbitrary iterable (such as a generator)?
- Find longest consecutive range of numbers in list
- Insert text in braces with asyncpg
- How does one put a link / url to the web-site's home page in Django?
- How to determine if a path is a subdirectory of another?
- Custom Keybindings for Ipython terminal
- FastAPI asynchronous background tasks blocks other requests?
- How to make sure that information from one file is duplicated into several text documents, without specific lines
- Installing a Python environment with Anaconda
- sklearn pipeline model predicting same results for all input
- Brew command not found after installing Anaconda Python
- How to get an XPath from selenium webelement or from lxml?
- Pipe PuTTY console to Python script
- How to align the axes of a figure in matplotlib?
- Persist ParentDocumentRetriever of langchain
- How to reset index in a pandas dataframe?