Search code examples
pythonpandascsvinfinite-loopstandard-deviation

An infinite while loop in python with pandas calculating the standard deviation


We're trying to delete outliers, but got an infinite loop instead

For a school project we (me and a friend) decided it'd be a good idea to create a tool based on data science. For this we started cleaning a database (I won't import it here because it is too big (xlsx file, csv file)). We're now trying to delete outliers using the 'standard deviation*3 + mean' rule for the column 'duration_minutes'.

Here is the code we used to calculate the standard-deviation and the mean:

def calculateSD(database, column):
    column = database[[column]]
    SD = column.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None)
    return SD

def calculateMean(database, column):
    column = database[[column]]
    mean = column.mean()
    return mean

We thought to do the following:

#Now we have to remove the outliers using the code from the SD.py and SDfunction.py files
minutes = trainsData['duration_minutes'].tolist() #takes the column duration_minutes and puts it in a list
SD = int(calculateSD(trainsData, 'duration_minutes')) #calculates the SD of the column
mean = int(calculateMean(trainsData, 'duration_minutes'))
SDhigh = mean+3*SD

The code above calculates the beginning values. Then we started a while-loop to delete outliers. After the outliers are deleted, we re-calculate the standard-deviation, the mean, and the SDhigh again. This is the while-loop:

while np.any(i >= SDhigh for i in minutes): #used to be >=, it doesnt matter for the outcome
    trainsData = trainsData[trainsData['duration_minutes'] < SDhigh] #used to be >=, this caused an infinite loop so I changed it to <=. Then to <
    minutes = trainsData['duration_minutes'].tolist()
    SD = int(calculateSD(trainsData, 'duration_minutes')) #calculates the SD of the column
    mean = int(calculateMean(trainsData, 'duration_minutes'))
    SDhigh = mean+3*SD
    print(SDhigh) #to see how the values changed and to confirm it is an infinite loop

the output is along the lines of the following:

611
652
428
354
322
308
300
296
296
296
296

it continues to print 296, and after hours of trying to solve it we came to the conclusion that we're not as smart as we wished.


TL;DR: We're trying to remove all values higher than standard-deviation*3+mean, until none are left (we re-calculate this every time to check if there still are outliers left). However, we get an infinite loop.


Solution

  • You are making things more difficult than they have to be. Calculating the standard deviation to remove outliers and then recalculating it etc is overly complicated (and statistically un-sound). You are better off by using percentile instead of standard deviation

    import numpy as np
    import pandas as pd
    
    # create data
    nums = np.random.normal(50, 8, 200)
    df = pd.DataFrame(nums, columns=['duration'])
    
    # set threshold based on percentiles
    threshold = df['duration'].quantile(.95) * 2
    
    # now only keep rows that are below the threshold
    df = df[df['duration']<threshold]