Search code examples
pythondatetimedata-miningsmoothing

smoothing irregularly sampled time data


Given a table where the first column is seconds past a certain reference point and the second one is an arbitrary measurement:

6   0.738158581
21  0.801697222
39  1.797224596
49  2.77920469
54  2.839757536
79  3.832232283
91  4.676794376
97  5.18244704
100 5.521878863
118 6.316630137
131 6.778507504
147 7.020395216
157 7.331607129
176 7.637492223
202 7.848079136
223 7.989456499
251 8.76853608
278 9.092367123 
    ...

As you see, the measurements are sampled at irregular time points. I need to smooth the data by averaging the reading up to 100 seconds prior each measurement (in Python). Since the data table is huge, an iterator-based method is really preferred. Unfortunately, after two hours of coding I can't figure out efficient and elegant solution.

Can anyone help me?

EDITs

  1. I want one smoothed reading for each raw reading, and the smoothed reading is to be the arithmetic mean of the raw reading and any others in the previous 100 (delta) seconds. (John, you are right)

  2. Huge ~ 1e6 - 10e6 lines + need to work with tight RAM

  3. The data is approximately random walk

  4. The data is sorted

RESOLUTION

I have tested solutions proposed by J Machin and yairchu. They both gave the same results, however, on my data set, J Machin's version performs exponentially, while that of yairchu is linear. Following are execution times as measured by IPython's %timeit (in microseconds):

data size   J Machin    yairchu
10        90.2        55.6
50          930         258
100         3080        514
500         64700       2660
1000        253000      5390
2000        952000      11500

Thank you all for the help.


Solution

  • I'm using a sum result to which I'm adding the new members and subtracting the old ones. However in this way one may suffer accumulating floating point inaccuracies.

    Therefore I implement a "Deque" with a list. And whenever my Deque reallocates to a smaller size. I recalculate the sum at the same occasion.

    I'm also calculating the average up to point x including point x so there's at least one sample point to average.

    def getAvgValues(data, avgSampleTime):
      lastTime = 0
      prevValsBuf = []
      prevValsStart = 0
      tot = 0
      for t, v in data:
        avgStart = t - avgSampleTime
        # remove too old values
        while prevValsStart < len(prevValsBuf):
          pt, pv = prevValsBuf[prevValsStart]
          if pt > avgStart:
            break
          tot -= pv
          prevValsStart += 1
        # add new item
        tot += v
        prevValsBuf.append((t, v))
        # yield result
        numItems = len(prevValsBuf) - prevValsStart
        yield (t, tot / numItems)
        # clean prevVals if it's time
        if prevValsStart * 2 > len(prevValsBuf):
          prevValsBuf = prevValsBuf[prevValsStart:]
          prevValsStart = 0
          # recalculate tot for not accumulating float precision error
          tot = sum(v for (t, v) in prevValsBuf)