Search code examples
pythonpandasscipynoise

Filtering out noise from gpx data


I have a Pandas dataframe with column speed in which I have occasional noise (data are from a Garmin and represent data captured during a run).

I am trying to find a way to average over adjacent points, but when I hit something like this

9.112273445
164.5779550738
84.4553498412
4.231089359
4.3740439706

I get caught in an infinite loop.

My algorithm is rather naive:

# Get list of indices in which value is great than 6:
idx = z[(z['speed']>=6)].index
while list(idx) != []:
    for i in idx:
        # check if out of bounds
        if i + 1 >= len(z):
            z.iloc[i, z.columns.get_indexer(['speed'])] = (z['speed'].ix[i-2] + z['speed'].ix[i-1])/2
        elif i - 1 < 0:
            z.iloc[i, z.columns.get_indexer(['speed'])] = (z['speed'].ix[i+1] + z['speed'].ix[i+2])/2
        else:
            z.iloc[i, z.columns.get_indexer(['speed'])] = (z['speed'].ix[i-1] + z['speed'].ix[i+1])/2
    idx = z[(z['speed']>=6)].index

The problem, of course, is when I have two adjacent values that are very large, this gets stuck in an infinite loop.

I am applying this filter (using a Hanning window) to eliminate random noise: SciPy Cookbook SignalSmooth, but it is not dealing with these large spikes in the data.

Short of discarding them, or setting them to a constant value, is there any other simple method of dealing with them?

EDIT

The values I am testing this over are:

0           NaN
1      3.508394
2      5.097879
3      7.743824
4      9.138245
5     13.315918
6     12.836310
7     12.001393
8     15.815223
9      0.000000
10    16.622944
11     9.061864
12     2.089729
13     2.710874
Name: speed, dtype: float64

Solution

  • If you want to "bridge" values greater than six you can do that like so:

    import numpy as np
    
    # locate outliers and adjacent values
    outliers = np.r_[False, (~np.isfinite(data)) | (data > 6), False]
    if np.any(outliers):
        boundaries = np.where(outliers[:-1] != outliers[1:])[0]
        lb = boundaries[::2]
        rb = boundaries[1::2]
        # special case if leftmost and/or rightmost values are outliers 
        lv = data[lb-1]
        if lb[0] == 0:
            lv[0] = data[rb[0]]
        rv = data[rb % len(data)]
        if rb[-1] == len(data):
            rv[-1] = data[lb[-1]-1]
        # create fill values; use a bit of trickery to keep it vectorised
        lengths = rb-lb
        fv = np.repeat((rv-lv)/(lengths+1), lengths)
        sw = np.cumsum(lengths[:-1])
        fv[sw] += fv[sw-1] - rv[:-1] + lv[1:]
        fv[0] += lv[0]
        fv = np.cumsum(fv)
        # place them
        out = data.copy()
        out[outliers[1:-1]] = fv
    else:
        out = data.copy()