Search code examples
pythondatetimeweighted-averagescikitstime-series

How to aggregate timeseries in Python?


I have two different timeseries with partially overlapping timestamps:

import scikits.timeseries as ts
from datetime import datetime 
a = ts.time_series([1,2,3], dates=[datetime(2010,10,20), datetime(2010,10,21), datetime(2010,10,23)], freq='D')
b = ts.time_series([4,5,6], dates=[datetime(2010,10,20), datetime(2010,10,22), datetime(2010,10,23)], freq='D')

which represents following data:

Day:   20. 21. 22. 23.
  a:    1   2   -   3
  b:    4   -   5   6

I would like to calculate a weighted average on every day with coefficients a(0.3) and b(0.7), while ignoring missing values:

Day 20.: (0.3 * 1 + 0.7 * 4) / (0.3 + 0.7) = 3.1 / 1.  = 3.1
Day 21.: (0.3 * 2          ) / (0.3      ) = 0.6 / 0.3 = 2
Day 22.: (          0.7 * 5) / (      0.7) = 3.5 / 0.7 = 5
Day 23.: (0.3 * 3 + 0.7 * 6) / (0.3 + 0.7) = 3.1 / 1.  = 5.1

when I first try to align these timeseries:

a1, b1 = ts.aligned(a, b)

I get correctly masked timeseries:

timeseries([1 2 -- 3],
  dates = [20-Oct-2010 ... 23-Oct-2010],
  freq  = D)

timeseries([4 -- 5 6],
  dates = [20-Oct-2010 ... 23-Oct-2010],
  freq  = D)

but when I do a1 * 0.3 + b1 * 0.7, it ignores values, that are present in one timeseries only:

timeseries([3.1 -- -- 5.1],
   dates = [20-Oct-2010 ... 23-Oct-2010],
   freq  = D)

What should I do to receive the awaited?

timeseries([3.1 2. 5. 5.1],
   dates = [20-Oct-2010 ... 23-Oct-2010],
   freq  = D)

EDIT: The answer should be applicable also to more than two initial timeseries with different weights and differently missing values.

So if we have four timeseries with weights T1(0.1), T2(0.2), T3(0.3) and T4(0.4), their weights at a given timestamp will be:

            |  T1 |  T2 |  T3 |  T4 |
weight      | 0.1 | 0.2 | 0.3 | 0.4 |
-------------------------------------
all present | 10% | 20% | 30% | 40% |
T1 missing  |     | 22% | 33% | 45% |
T1,T2 miss. |     |     | 43% | 57% |
T4 missing  | 17% | 33% | 50% |     |
etc.

Solution

  • I have tried and found this:

    aWgt = 0.3
    bWgt = 0.7
    
    print (np.where(a1.mask, 0., a1.data * aWgt) +
           np.where(b1.mask, 0., b1.data * bWgt)) / (np.where(a1.mask, 0., aWgt) +
                                                     np.where(b1.mask, 0., bWgt))
    
    # array([ 3.1,  2. ,  5. ,  5.1])
    

    This is applicable to the edited question with more than one initial timeseries. But hopefully someone will find better.

    EDIT: And this is my function:

    def weightedAvg(weightedTimeseries):
        sumA = np.sum((np.where(ts.mask, 0., ts.data * weight) for ts, weight in weightedTimeseries), axis=0)
        sumB = np.sum((np.where(ts.mask, 0., weight) for ts, weight in weightedTimeseries), axis=0)
        return np.divide(sumA, sumB)
    
    weightedAvg(((a1, 0.3), (bb, 0.7)))
    # array([ 3.1,  2. ,  5. ,  5.1])
    

    Works for any number of timeseries ;-)