Search code examples
pythondatetimepandasgroupingpython-datetime

How to group rows within a time period using Python


I have a DataFrame of some transactions. I want to group these transactions with respect to their item and time column values: the goal is to group items that are within 1 hour of each other. So we start a new group at the time of the next observation that wasn't within an hour of the observation prior (See column start time in DataFrame B).

Here is the data: I want to convert A to B.

A=
item    time             result
A   2016-04-18 13:08:25  Y
A   2016-04-18 13:57:05  N
A   2016-04-18 14:00:12  N
A   2016-04-18 23:45:50  Y
A   2016-04-20 16:53:48  Y
A   2016-04-20 17:11:47  N
B   2016-04-18 15:24:48  N
C   2016-04-23 13:20:44  N
C   2016-04-23 14:02:23  Y


B=
item    start time            end time      Ys  Ns  total count
A   2016-04-18 13:08:25 2016-04-18 14:08:25 1   2   3
A   2016-04-18 23:45:50 2016-04-18 00:45:50 1   0   1
A   2016-04-20 16:53:48 2016-04-20 17:53:48 1   1   2
B   2016-04-18 15:24:48 2016-04-18 16:24:48 0   1   1
C   2016-04-23 13:20:44 2016-04-23 14:20:44 1   1   2

Here is what I did:

grouped = A.groupby('item')
A['end'] = (grouped['time'].transform(lambda grp: grp.min()+pd.Timedelta(hours=1)))
A2 = A.loc[(A['time'] <= A['end'])]

This gives me one group per day: the transaction within 1 hour of the first transaction. So, I'm missing other transactions in the same day but more than 1 hour apart from the first. My struggle is how to get those groups. I can then use pd.crosstab to get the details I want from the result column.

Another idea I have is to sort A by item and time, and then go row by row. If the time is within 1 hour of the previous row, it adds to that group, otherwise, it creates a new group.


Solution

  • 1) Set up a window_end column for later use with .groupby(), and define .get_windows() to check, for each item group, if a row fits the current current 1hr window, or do nothing and keep the initialized value. Apply to all item groups:

    df['window_end'] = df.time + pd.Timedelta('1H')
    
    def get_windows(data):
        window_end = data.iloc[0].window_end
        for index, row in data.iloc[1:].iterrows():
            if window_end > row.time:
                df.loc[index, 'window_end'] = window_end
            else:
                window_end = row.window_end
    
    df.groupby('item').apply(lambda x: get_windows(x))
    

    2) Use windows and item with .groupby() and return .value_counts() as transposed DataFrame, clean up index, and add total:

    df = df.groupby(['window_end', 'item']).result.apply(lambda x: x.value_counts().to_frame().T)
    df = df.fillna(0).astype(int).reset_index(level=2, drop=True)
    df['total'] = df.sum(axis=1)
    

    to get:

                                N  Y  total
    window_end          item               
    2016-04-18 14:08:25 A    A  2  1      3
    2016-04-18 16:24:48 B    B  1  0      1
    2016-04-19 00:45:50 A    A  0  1      1
    2016-04-20 17:53:48 A    A  1  1      2
    2016-04-23 14:20:44 C    C  1  1      2