Search code examples
pythonlistoutliers

Quickly remove outliers from list in Python?


I have a many long lists of time and temperature values, which has the following structure:

list1 = [[1, 72], [2, 72], [3, 73], [4, 72], [5, 74], [6, 73], [7, 71], [8, 92], [9, 73]]

Some of the time/temperature pairs are incorrect spikes in the data. For example, in time 8, it spiked to 92 degrees. I would like to get rid of these sudden jumps or dips in the temperature values.

To do this, I wrote the following code (I removed the stuff that isn't necessary and only copied the part that removes the spikes/outliers):

outlierpercent = 3

for i in values:
    temperature = i[1]
    index = values.index(i)
    if index > 0:
        prevtemp = values[index-1][1]
        pctdiff = (temperature/prevtemp - 1) * 100
        if abs(pctdiff) > outlierpercent:
            outliers.append(i)

While this works (where I can set the minimum percentage difference required for it to be considered a spike as outlierpercent), it takes a super long time (5-10 minutes per list). My lists are extremely long (around 5 million data points each), and I have hundreds of lists.

I was wondering if there was a much quicker way of doing this? My main concern here is time. There are other similar questions like this, however, they don't seem to be quite efficient for super long list of this structure, so I'm not sure how to do it! Thanks!


Solution

  • outlierpercent = 3
    
    for index in range(1, len(values)):
        temperature = values[index][1]
        prevtemp = values[index-1][1]
    
        pctdiff = (temperature/prevtemp - 1) * 100
        if abs(pctdiff) > outlierpercent:
            outliers.append(index)
    

    This should do a lot better with time

    Update:

    The issue of only first outlier being removed is because after we remove an outlier, in the next iteration, we are comparing the temp from the removed outlier (prevtemp = values[index-1][1]).

    I believe you can avoid that by handling the previous temp better. Something like this:

    outlierpercent = 3
    prevtemp = values[0][1]
    
    for index in range(1, len(values)):
        temperature = values[index][1]
    
        pctdiff = (temperature/prevtemp - 1) * 100
        # outlier - add to list and don't update prev temp
        if abs(pctdiff) > outlierpercent:
            outliers.append(index)
        # valid temp, update prev temp
        else:
            prevtemp = values[index-1][1]