I have the following function:
from scipy.stats.mstats import winsorize
import pandas as pd
# winsorize function
def winsor_try1(var, lower, upper):
var = winsorize(var,limits=[lower,upper])
'''
Outliers Calculation using IQR
'''
q1, q3= np.percentile(var, [25, 75]) # q1,q3 calc
iqr = q3 - q1 # iqr calc
lower_bound = round(q1 - (1.5 * iqr),3) # lower bound
upper_bound = round(q3 + (1.5 * iqr),3) # upper bound
outliers = [x for x in var if x < lower_bound or x > upper_bound]
print('These would be the outliers:', set(outliers),'\n',
'Total:', len(outliers),'.Upper bound & Lower bound:', lower_bound,'&',upper_bound)
# the variable
df = pd.DataFrame({
'age': [1,1,2,5,5,2,5,4,8,2,5,1,41,2,1,4,4,1,1,4,1,2,15,21,5,1,8,22,1,5,2,5,256,5,6,2,2,8,452]})
I would like to write a while loop
function where I would like to apply function winsor_try1
on the variable df['age']
, starting at lower = .01
& upper = .01
until len(outliers) = 0.
My rationale is: as long as len(outliers) > 0
, I would like the function to be repeated until I can find the limit until the outliers in the age
distribution becomes 0.
Desired output would be something like this:
print('At limit =', i, 'there is no more outliers presented in the age variable.')
i
= the limit where len(outliers) = 0
.
Rather than writing a while
loop yourself, you can think of this as a scalar root finding problem and use scipy.optimize.root_scalar.
import numpy as np
from scipy.stats.mstats import winsorize
from scipy.optimize import root_scalar
# winsorize function
def winsor_try1(var, lower, upper):
'''
Compute the number of IQR outliers
'''
var = winsorize(var,limits=[lower,upper])
q1, q3= np.percentile(var, [25, 75]) # q1,q3 calc
iqr = q3 - q1 # iqr calc
lower_bound = round(q1 - (1.5 * iqr),3) # lower bound
upper_bound = round(q3 + (1.5 * iqr),3) # upper bound
outliers = [x for x in var if x < lower_bound or x > upper_bound]
return len(outliers)
# the variable
var = np.asarray([1,1,2,5,5,2,5,4,8,2,5,1,41,2,1,4,4,1,1,4,1,2,15,21,5,1,8,22,1,5,2,5,256,5,6,2,2,8,452])
def fun(i):
# try to find `i` at which there is half an outlier
# it doesn't exist, but this should get closer to the transition
return winsor_try1(var, i, i) - 0.5
# root_scalar tries to find the argument `i` that makes `fun` return zero
res = root_scalar(fun, bracket=(0, 0.5))
eps = 1e-6
print(winsor_try1(var, res.root + eps, res.root + eps)) # 0
print(winsor_try1(var, res.root - eps, res.root - eps)) # 6
res.root # 0.15384615384656308
There may be better ways to solve the problem, but I tried to answer the question in a way that's similar to writing a while
loop. If you want to know how that while
loop could work, there are lots of references on the bisection method or other scalar rootfinding algorithms.