Search code examples
pythonpandasdataframeaggregatecriteria

Aggregating Pandas DataFrame rows based on multiple criteria (userID, end date = start date, etc.)


I've read through a lot of answers on how to aggregate rows in a pandas dataframe but I've had a hard time figuring out how to apply it to my case. I have a dataframe containing trips data for vehicles. So each vehicle within a given day can do several trips. Here's an example below:

vehicleID start pos time end pos time duration (seconds) meters travelled
XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00 1434 2000
XXXXX 2021-10-19 13:49:09+00:00 2021-10-19 13:59:29+00:00 620 5000
XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:26:40+00:00 364 70000
YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00 932 8000
YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00 728 90000

It often happens that the start time of a trip, on a given day, is only a few minutes after the end time of the previous trip, which means that these can be chained into a single trip.

I would like to aggregate the rows so that if the new start pos time overlaps with the previous pos time, or a gap of less than 30 minutes happens between the two, these become a single row, summing the duration of the trip in seconds and meters travelled, obviously by vehicleID. The new df should also contain those trips that didn't require the aggregation (edited for clarity). So this is the output I'm trying to get:

vehicleID start pos time end pos time duration (seconds) meters travelled
XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00 1434 2000
XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:59:29+00:00 984 75000
YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00 932 8000
YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00 728 90000

I feel like a groupby and an agg would be involved by I have no clue how to go about this. Any help would be appreciated! Thanks!


Solution

  • There is probably a more efficient way to code this up, but something like this should work (the new_df has what you're looking for):

    NOTE: code below assumes start and end times are in datetime format

    
    df = pd.DataFrame({'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
                          4: 'YYYYY'},
            'start pos time': {0: '2021-10-26 06:01:12+00:00',
                               1: '2021-10-19 13:49:09+00:00',
                               2: '2021-10-19 13:20:36+00:00',
                               3: '2022-09-10 15:14:07+00:00',
                               4: '2022-08-28 15:16:35+00:00'},
            'end pos time': {0: '2021-10-26 06:25:06+00:00',
                             1: '2021-10-19 13:59:29+00:00',
                             2: '2021-10-19 13:26:40+00:00',
                             3: '2022-09-10 15:29:39+00:00',
                             4: '2022-08-28 15:28:43+00:00'},
            'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
            'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
            })
    
    # sort dataframe by ID and then start time of trip
    df = df.sort_values(by=['vehicleID', 'start pos time'])
    
    # create a new column with the end time of the previous ride
    df.loc[:, 'prev end'] = df['end pos time'].shift(1)
    
    # create a new column with the difference between the start time of the current trip and the end time of the prior one
    df.loc[:, 'diff'] = df.loc[:, 'start pos time'] - df.loc[:, 'prev end']
    
    
    # helper function to convert difference between datetime objects to seconds
    def get_total_seconds(datetime_delta):
        return datetime_delta.total_seconds()
    
    
    # convert difference column to seconds
    df.loc[:, 'diff'] = df['diff'].apply(get_total_seconds)
    
    # where vehicle IDs are the same and the difference between the start time of the current trip and end time of the
    # prior trip is less than or equal to 30 minutes, change the start time of the current trip to the start time of the 
    # prior one
    df.loc[((df['vehicleID'] == df['vehicleID'].shift(1)) & (df['diff'] <= 30*60)), 'start pos time'] = df['start pos time'].shift(1)
    
    # create a new dataframe, grouped by vehicle ID and trip start time, using the maximum end time for each group
    new_df = df.groupby(['vehicleID', 'start pos time'], as_index=False).agg({'end pos time':'max',
                                                                              'duration (seconds)':'sum',
                                                                              'meters travelled':'sum'})
    

    EDIT: if there may be >2 trips that need to be aggregated (as @ouroboros1 pointed out) you can replace everything after the "convert difference column to seconds" code with:

    # [based on @ouroboros1 solution] where vehicle IDs are the same and the difference between the start time of the current
    # trip and end time of the prior trip is less than or equal to 30 minutes, put trips in the same "group"
    df.loc[:, 'group'] = ((df['vehicleID'] != df['vehicleID'].shift(1)) | (df['diff'] > 30*60)).cumsum()
    
    # create a new dataframe, grouped by vehicle ID and group, using the minimum start time and maximum end time for each group
    new_df = df.groupby(['vehicleID', 'group'], as_index=False).agg({'start pos time':'min',
                                                                     'end pos time':'max',
                                                                     'duration (seconds)':'sum',
                                                                     'meters travelled':'sum'})