Search code examples
pythonpandasweighted-average

Create a weighted mean for a irregular timeseries in pandas


from simulation data with variable timestep I have a irregular time-vector as index for my values, they are stored in a pandas.DataFrame.

Let's consider a simplified test case:

import pandas as pd
import datetime
time_vec = [datetime.time(0,0),datetime.time(0,0),datetime.time(0,5),datetime.time(0,7),datetime.time(0,10)]
df = pd.DataFrame([1,2,4,3,6],index = time_vec)

Using the normal df.mean() -function would result in the answer 3.2, which would only be true if the time-vector would be equidistant.

I think the correct result would be 3.55 as for the first timestep (zero seconds long), the average value is 1.5, for the second timestep the average Value is 3 (five minutes long) etc, this results in:

1.5 * 0 + 3*5 + 3.5 * 2 + 4.5 * 3 = 35.5

which results in an average of 3.55 (35.5 / (0 + 5 + 2 + 3)).

Is there an efficient way to do this with pandas?

This should finally result in something like

df.resample('15M',how = 'This very Method I am looking for') 

to create average values with an equidistant time-vector.


Solution

  • Well, I figured out how to solve my problem. I don't know, if it is a nice solution, but it works.

    I changed the original code in the question by exchanging datetime.time by datetime.datetime, otherwise it won't work (there is no method total_seconds() for datetime.time-Objects). I also had to import numpy to be able to use numpy.average.

    So now the code would be:

    import datetime
    import numpy as np
    import pandas as pd
    time_vec =     [datetime.datetime(2007,1,1,0,0)
                   ,datetime.datetime(2007,1,1,0,0) 
                   ,datetime.datetime(2007,1,1,0,5)     
                   ,datetime.datetime(2007,1,1,0,7)
                   ,datetime.datetime(2007,1,1,0,10)]
    df = pd.DataFrame([1,2,4,3,6],index = time_vec)
    

    This little function solved my problem:

    def time_based_weighted_mean(tv_df):
        time_delta = [(x-y).total_seconds() for x,y in zip(df.index[1:],df.index[:-1])]
        weights = [x+y for x,y in zip([0]+ time_delta,time_delta+[0])]
        res = np.average(df[0],weights = weights) 
        return res
    print time_based_weighted_mean(df[0])
    

    I first tried to use pd.index.diff() to compute the time_delta-Array, but this resulted in a numpy.datetime64 Series, where I did not know how to convert them into floats, as np.average requires floats as input-type for weights.

    I'm thankful for any suggestions to improve the code.