Search code examples
pythonpandasinfluxdb

Pandas timebased mean


I have imported different data from homeassistant to influx db now I have it in a pandas dataframe and I would like to get the the mean() but it should be based on the time in the index.

I picked a small df to test and it looks like this:

                                   value
time                                   
2021-03-09 07:25:41.989791+00:00    0.0
2021-03-09 07:26:45.165453+00:00    0.0
2021-03-09 16:56:04.806150+00:00    1.0
2021-03-09 18:10:57.762609+00:00    0.0
2021-03-09 19:45:55.182860+00:00    1.0
2021-03-09 19:49:27.519186+00:00    0.0

So this is for example just my one of my lights at home. I would like to have the mean over time. So I can see how many % of the time it is turned on. When it says 1 it should be 1 until the next datapoint. In this df the mean should be very low because most of the time the light is turned off. It is only turned on between 16:56:04 to 18:10:57 and 19:45:55 to 19:49:27. So it is turned on approx: 1hour, 19minutes and the total recorded time 12 hours, 23 minutes. So the light was on for approx 10% of the recorded time.

There are another two problems:

  1. I have sensors that have different values than 1 and 0 (for example temperature)

    2.the day will change.

I don't really know how to get started does anyone have a idea ?


Solution

  • You can calculate durations for each value (time_next - time, then convert to seconds), and then take weighted average for each day:

    # calculate durations
    df['date'] = df['time'].dt.date
    df['time_next'] = df['time'].shift(-1).ffill()
    df['duration_s'] = (df['time_next'] - df['time']).dt.seconds
    
    # calculate weighted average by date
    df.groupby('date').apply(
        lambda z: np.average(z['value'], weights=z['duration_s']))
    

    Output:

    date
    2021-03-09    0.105416
    dtype: float64
    

    P.S. Values can be any number, of course, not just binary


    Update:

    For correctly working with multi-day series we can alter the dataframe adding records for the end of each day (or the beginning of next day) with the last value seen on that day.

    In the sample data in the original post, for example, that would be a 0 value at 2021-03-10 00:00:00. This way duration for the last record of the day will show the amount of seconds remaining until the end of the day, making our weighted average calculate correctly.

    Here is how it may be done in code (here I'm assuming time is the index of the original dataframe):

    # add day-end values
    z = df.append(
        df.resample('1d', label='right').last().ffill()
    ).sort_index()
    
    # calculate durations
    z['duration_s'] = -z.index.to_series().diff(-1).dt.total_seconds()
    
    # calculate weighted average by date
    z.groupby(z.index.date).apply(
        lambda z: np.average(z['value'], weights=z['duration_s'])).dropna()