Search code examples
pythonpandasdataframetime-seriesdate-range

DataFrame time interval extrapolation


I have a DataFrame given with a 10 minute interval which looks likes this:

    intervall_start         group   count
0   2014-03-05 14:30:00     A       2
1   2014-03-05 14:40:00     A       2
2   2014-03-05 14:50:00     B       2
3   2014-03-05 15:00:00     B       1
4   2014-03-05 15:10:00     A       4
5   2014-03-05 15:20:00     B       1

My goal is now to extrapolate this 10 minute interval into a 15 minute, 30 minute or 1 hour interval. I want to set the interval in which the 10 minute interval should be extrapolated. So it should be dynamically!

Let's say I want to extrapolate the 10 minute interval into a 15 minute interval then I want to have this result.

    intervall_start         group   count
0   2014-03-05 14:30:00     A       (10/15 * 2 + 5/15 * 2) = 2
1   2014-03-05 14:45:00     A       (5/15 * 2 + 10/15 * 0) = 0.666
2   2014-03-05 15:00:00     A       (10/15 * 4 + 5/15 * 0) = 2.666
3   2014-03-05 15:15:00     A       (5/15 * 4 + 10/15 * 1) = 2

    intervall_start         group   count
0   2014-03-05 14:45:00     B       (5/15 * 0 + 10/15 * 2) = 1.333
1   2014-03-05 15:00:00     B       (10/15 * 1 + 5/15 * 0) = 0.666
2   2014-03-05 15:15:00     B       (5/15 * 0 + 10/15 * 1) = 0.666

Do you guys have any idea how to do that? But as I said, dynamically! I also want to be able to shift it into a 5 minute or 1 hour interval!

Why I use 10/15 and 5/15 as weights? Because this is the weight for the 15 minute interval. For exmaple if I have 3 counts from 15:00 to 15:10 then this is a 10/15 * 3 weight!

Explanation of count:

I count in my database if the gate for the group was open or closed. 1 is opened and 0 is closed. I just select the gates which are opened (1) and count them. So, I know how many gates were opened for this group!

Now I am not allowed to sum those counts up! Because otherwise I would have more gates opened then I actually have! I have to calculate the average of gates which were opened for the timeframe!

Kind regards and thank you!


Solution

  • I believe what you're looking for is resample:

    # Create a date-time index
    df.intervall_start = pd.to_datetime(df.intervall_start)
    df.set_index("intervall_start", inplace = True)
    
    # make sure we have a sample every 10 minutes (fill missing samples with 0)
    df = df.groupby("group")[["count"]].resample("10min").asfreq()
    df = df.fillna(0)
    
    # sample every minute, to be later used when calculatiung the mean per period
    df = df.reset_index(level = "group").groupby("group")[["count"]].resample("1min").pad()
    
    # resample for every 15 minutes using 'mean'
    df = df.reset_index(level = "group").groupby("group")[["count"]].resample("15min").mean()
    

    The output is:

                                  count
    group intervall_start              
    A     2014-03-05 14:30:00  2.000000
          2014-03-05 14:45:00  0.666667
          2014-03-05 15:00:00  0.363636
    B     2014-03-05 14:45:00  2.000000
          2014-03-05 15:00:00  0.666667
          2014-03-05 15:15:00  0.166667