For a school project we (me and a friend) decided it'd be a good idea to create a tool based on data science. For this we started cleaning a database (I won't import it here because it is too big (xlsx file, csv file)). We're now trying to delete outliers using the 'standard deviation*3 + mean' rule for the column 'duration_minutes'.
Here is the code we used to calculate the standard-deviation and the mean:
def calculateSD(database, column):
column = database[[column]]
SD = column.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None)
return SD
def calculateMean(database, column):
column = database[[column]]
mean = column.mean()
return mean
We thought to do the following:
#Now we have to remove the outliers using the code from the SD.py and SDfunction.py files
minutes = trainsData['duration_minutes'].tolist() #takes the column duration_minutes and puts it in a list
SD = int(calculateSD(trainsData, 'duration_minutes')) #calculates the SD of the column
mean = int(calculateMean(trainsData, 'duration_minutes'))
SDhigh = mean+3*SD
The code above calculates the beginning values. Then we started a while-loop to delete outliers. After the outliers are deleted, we re-calculate the standard-deviation, the mean, and the SDhigh again. This is the while-loop:
while np.any(i >= SDhigh for i in minutes): #used to be >=, it doesnt matter for the outcome
trainsData = trainsData[trainsData['duration_minutes'] < SDhigh] #used to be >=, this caused an infinite loop so I changed it to <=. Then to <
minutes = trainsData['duration_minutes'].tolist()
SD = int(calculateSD(trainsData, 'duration_minutes')) #calculates the SD of the column
mean = int(calculateMean(trainsData, 'duration_minutes'))
SDhigh = mean+3*SD
print(SDhigh) #to see how the values changed and to confirm it is an infinite loop
the output is along the lines of the following:
611
652
428
354
322
308
300
296
296
296
296
it continues to print 296, and after hours of trying to solve it we came to the conclusion that we're not as smart as we wished.
TL;DR: We're trying to remove all values higher than standard-deviation*3+mean, until none are left (we re-calculate this every time to check if there still are outliers left). However, we get an infinite loop.
You are making things more difficult than they have to be. Calculating the standard deviation to remove outliers and then recalculating it etc is overly complicated (and statistically un-sound). You are better off by using percentile instead of standard deviation
import numpy as np
import pandas as pd
# create data
nums = np.random.normal(50, 8, 200)
df = pd.DataFrame(nums, columns=['duration'])
# set threshold based on percentiles
threshold = df['duration'].quantile(.95) * 2
# now only keep rows that are below the threshold
df = df[df['duration']<threshold]