Search code examples
pythonpandasdataframebinning

Binning state time of a device by day using Pandas DataFrames


I have a Pandas dataframe of power enable/disable commands vs time. The index is not in-use at the moment. You can create it on your own with this:

pd.DataFrame(
    {'command_timestamp': {
        0: pd.Timestamp('2023-08-01 15:39:42'),
        1: pd.Timestamp('2023-08-02 03:30:39'),
        2: pd.Timestamp('2023-08-02 16:09:35'),
        4: pd.Timestamp('2023-08-02 17:30:16'),
        5: pd.Timestamp('2023-08-02 17:32:05'),
        6: pd.Timestamp('2023-08-02 17:45:43'),
        7: pd.Timestamp('2023-08-03 17:48:01'),
        8: pd.Timestamp('2023-08-03 18:20:11'),
        9: pd.Timestamp('2023-08-04 18:49:37'),
        10: pd.Timestamp('2023-08-07 21:13:05')},
     'command': {
        0: 'enable',
        1: 'disable',
        2: 'enable',
        4: 'enable',
        5: 'enable',
        6: 'disable',
        7: 'enable',
        8: 'disable',
        9: 'enable',
        10: 'disable'}})

My dataframe.

What I am trying to do

I need to calculate the "ON time" of the device by day. Imagine that the actual data set is much larger than this example set. I cannot seem to come up with good solution, much less one that doesn't involve iterating through the dataframe and checking a load of if statements.

Some assumptions and things to consider:

  • Consecutive enable or disable commands can be ignored (i.e. if the device is already ON, another 'enable' won't do anything, and there are no missing commands).
  • While, given the previous point, we can't know for sure, we will assume that if the first command of the dataset is "enable", the device was OFF all day prior to that.
    • Similarly, if the first command was "disable", we will assume it was ON all day up to that point.
  • If the last command of the dataset was "enable", we will assume the device was ON for the remainder of the day.
  • If it is turned on one day and not turned off until the next day (or multiple days later), the time ON for day one should go until midnight, and the time ON for the following days should start at midnight. Put another way, a single day shouldn't have more than 24hr of ON time.

For a little context, this is for representing "daily utilization" of the device rather than counting "consecutive on-time" etc..

Sample data expected results

Here is a manual calculation of the example dataset:

results = {
    '2023-08-01': (
        pd.Timestamp('2023-08-02 00:00:00') - 
        pd.Timestamp('2023-08-01 15:39:42')),
    
    '2023-08-02': (
        pd.Timestamp('2023-08-02 03:30:39') - 
        pd.Timestamp('2023-08-02 00:00:00')
        ) + (
        pd.Timestamp('2023-08-02 17:45:43') - 
        pd.Timestamp('2023-08-02 16:09:35')
        ), 
    '2023-08-03': (
        pd.Timestamp('2023-08-03 18:20:11') - 
        pd.Timestamp('2023-08-03 17:48:01')
        ), 
    '2023-08-04': (
        pd.Timestamp('2023-08-05 00:00:00') - 
        pd.Timestamp('2023-08-04 18:49:37')
        ), 
    '2023-08-05': (
        pd.Timestamp('2023-08-06 00:00:00') - 
        pd.Timestamp('2023-08-05 00:00:00')
        ), 
    '2023-08-06': (
        pd.Timestamp('2023-08-07 00:00:00') - 
        pd.Timestamp('2023-08-06 00:00:00')
        ), 
    '2023-08-07': (
        pd.Timestamp('2023-08-07 21:13:05') - 
        pd.Timestamp('2023-08-07 00:00:00')
        )
}

Sample dataset results.


Solution

  • Okay, I now think I understand your question. First, create fictitious entries to mark your day boundaries.

    df = pd.concat([
        df.set_index('command_timestamp'),
        df.reindex(pd.date_range(
            start=df['command_timestamp'].min().date(),
            end=df['command_timestamp'].max().date(),
            freq='D'))
    ]).sort_index()
    

    (I assume start is disable.) Then assume that the last state carries forward to the fictitious day boundary. Then do the differencing. You must pass to_series to access the right function.

    df['command'] = df['command'].ffill().fillna('disable')
    df['diff'] = df.index.to_series().diff().shift(-1)
    

    A group by the date along with a filtered summation then yields your desired outcome:

    >>> df.groupby(df.index.to_series().dt.date) \
    ...     .apply(lambda d: d.loc[d['command'] == 'enable', 
    ...                            'diff'].sum())
    2023-08-01   0 days 08:20:18
    2023-08-02   0 days 05:06:47
    2023-08-03   0 days 00:32:10
    2023-08-04   0 days 05:10:23
    2023-08-05   1 days 00:00:00
    2023-08-06   1 days 00:00:00
    2023-08-07   0 days 21:13:05
    dtype: timedelta64[ns]
    

    Explanatory note. Prior to the group by summation, the relevant columns of the data frame look like this:

                         command            diff
    2023-08-01 00:00:00  disable 0 days 15:39:42
    2023-08-01 15:39:42   enable 0 days 08:20:18
    2023-08-02 00:00:00   enable 0 days 03:30:39
    2023-08-02 03:30:39  disable 0 days 12:38:56
    2023-08-02 16:09:35   enable 0 days 01:20:41
    2023-08-02 17:30:16   enable 0 days 00:01:49
    2023-08-02 17:32:05   enable 0 days 00:13:38
    2023-08-02 17:45:43  disable 0 days 06:14:17
    2023-08-03 00:00:00  disable 0 days 17:48:01
    2023-08-03 17:48:01   enable 0 days 00:32:10
    2023-08-03 18:20:11  disable 0 days 05:39:49
    2023-08-04 00:00:00  disable 0 days 18:49:37
    2023-08-04 18:49:37   enable 0 days 05:10:23
    2023-08-05 00:00:00   enable 1 days 00:00:00
    2023-08-06 00:00:00   enable 1 days 00:00:00
    2023-08-07 00:00:00   enable 0 days 21:13:05
    2023-08-07 21:13:05  disable             NaT
    

    It may be more didactic to read diff instead as time_elapsed.