Search code examples
pythonwhile-loopscipy

How to write a while loop function in python for winsorizing


I have the following function:

from scipy.stats.mstats import winsorize 
import pandas as pd

# winsorize function
def winsor_try1(var, lower, upper):
    var = winsorize(var,limits=[lower,upper])
    ''' 
    Outliers Calculation using IQR 
    ''' 
    q1, q3= np.percentile(var, [25, 75])                 # q1,q3 calc
    iqr = q3 - q1                                        # iqr calc
    lower_bound = round(q1 - (1.5 * iqr),3)              # lower bound
    upper_bound = round(q3 + (1.5 * iqr),3)              # upper bound
    outliers = [x for x in var if x < lower_bound or x > upper_bound]  
    print('These would be the outliers:', set(outliers),'\n',
          'Total:', len(outliers),'.Upper bound & Lower bound:', lower_bound,'&',upper_bound)

# the variable 
df = pd.DataFrame({
    'age': [1,1,2,5,5,2,5,4,8,2,5,1,41,2,1,4,4,1,1,4,1,2,15,21,5,1,8,22,1,5,2,5,256,5,6,2,2,8,452]})

I would like to write a while loop function where I would like to apply function winsor_try1 on the variable df['age'], starting at lower = .01 & upper = .01 until len(outliers) = 0.

My rationale is: as long as len(outliers) > 0, I would like the function to be repeated until I can find the limit until the outliers in the age distribution becomes 0.

Desired output would be something like this:

print('At limit =', i, 'there is no more outliers presented in the age variable.')

i = the limit where len(outliers) = 0.


Solution

  • Rather than writing a while loop yourself, you can think of this as a scalar root finding problem and use scipy.optimize.root_scalar.

    import numpy as np
    from scipy.stats.mstats import winsorize
    from scipy.optimize import root_scalar 
    
    # winsorize function
    def winsor_try1(var, lower, upper):
        ''' 
        Compute the number of IQR outliers
        ''' 
        var = winsorize(var,limits=[lower,upper])
        q1, q3= np.percentile(var, [25, 75])                 # q1,q3 calc
        iqr = q3 - q1                                        # iqr calc
        lower_bound = round(q1 - (1.5 * iqr),3)              # lower bound
        upper_bound = round(q3 + (1.5 * iqr),3)              # upper bound
        outliers = [x for x in var if x < lower_bound or x > upper_bound]  
        return len(outliers)
    
    # the variable 
    var = np.asarray([1,1,2,5,5,2,5,4,8,2,5,1,41,2,1,4,4,1,1,4,1,2,15,21,5,1,8,22,1,5,2,5,256,5,6,2,2,8,452])
    
    def fun(i):
      # try to find `i` at which there is half an outlier
      # it doesn't exist, but this should get closer to the transition
      return winsor_try1(var, i, i) - 0.5
    
    # root_scalar tries to find the argument `i` that makes `fun` return zero
    res = root_scalar(fun, bracket=(0, 0.5))
    
    eps = 1e-6
    print(winsor_try1(var, res.root + eps, res.root + eps))  # 0
    print(winsor_try1(var, res.root - eps, res.root - eps))  # 6
    res.root  # 0.15384615384656308
    

    There may be better ways to solve the problem, but I tried to answer the question in a way that's similar to writing a while loop. If you want to know how that while loop could work, there are lots of references on the bisection method or other scalar rootfinding algorithms.