I've read through a lot of answers on how to aggregate rows in a pandas dataframe but I've had a hard time figuring out how to apply it to my case. I have a dataframe containing trips data for vehicles. So each vehicle within a given day can do several trips. Here's an example below:
vehicleID | start pos time | end pos time | duration (seconds) | meters travelled |
---|---|---|---|---|
XXXXX | 2021-10-26 06:01:12+00:00 | 2021-10-26 06:25:06+00:00 | 1434 | 2000 |
XXXXX | 2021-10-19 13:49:09+00:00 | 2021-10-19 13:59:29+00:00 | 620 | 5000 |
XXXXX | 2021-10-19 13:20:36+00:00 | 2021-10-19 13:26:40+00:00 | 364 | 70000 |
YYYYY | 2022-09-10 15:14:07+00:00 | 2022-09-10 15:29:39+00:00 | 932 | 8000 |
YYYYY | 2022-08-28 15:16:35+00:00 | 2022-08-28 15:28:43+00:00 | 728 | 90000 |
It often happens that the start time of a trip, on a given day, is only a few minutes after the end time of the previous trip, which means that these can be chained into a single trip.
I would like to aggregate the rows so that if the new start pos time overlaps with the previous pos time, or a gap of less than 30 minutes happens between the two, these become a single row, summing the duration of the trip in seconds and meters travelled, obviously by vehicleID. The new df should also contain those trips that didn't require the aggregation (edited for clarity). So this is the output I'm trying to get:
vehicleID | start pos time | end pos time | duration (seconds) | meters travelled |
---|---|---|---|---|
XXXXX | 2021-10-26 06:01:12+00:00 | 2021-10-26 06:25:06+00:00 | 1434 | 2000 |
XXXXX | 2021-10-19 13:20:36+00:00 | 2021-10-19 13:59:29+00:00 | 984 | 75000 |
YYYYY | 2022-09-10 15:14:07+00:00 | 2022-09-10 15:29:39+00:00 | 932 | 8000 |
YYYYY | 2022-08-28 15:16:35+00:00 | 2022-08-28 15:28:43+00:00 | 728 | 90000 |
I feel like a groupby and an agg would be involved by I have no clue how to go about this. Any help would be appreciated! Thanks!
There is probably a more efficient way to code this up, but something like this should work (the new_df has what you're looking for):
NOTE: code below assumes start and end times are in datetime format
df = pd.DataFrame({'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
4: 'YYYYY'},
'start pos time': {0: '2021-10-26 06:01:12+00:00',
1: '2021-10-19 13:49:09+00:00',
2: '2021-10-19 13:20:36+00:00',
3: '2022-09-10 15:14:07+00:00',
4: '2022-08-28 15:16:35+00:00'},
'end pos time': {0: '2021-10-26 06:25:06+00:00',
1: '2021-10-19 13:59:29+00:00',
2: '2021-10-19 13:26:40+00:00',
3: '2022-09-10 15:29:39+00:00',
4: '2022-08-28 15:28:43+00:00'},
'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
})
# sort dataframe by ID and then start time of trip
df = df.sort_values(by=['vehicleID', 'start pos time'])
# create a new column with the end time of the previous ride
df.loc[:, 'prev end'] = df['end pos time'].shift(1)
# create a new column with the difference between the start time of the current trip and the end time of the prior one
df.loc[:, 'diff'] = df.loc[:, 'start pos time'] - df.loc[:, 'prev end']
# helper function to convert difference between datetime objects to seconds
def get_total_seconds(datetime_delta):
return datetime_delta.total_seconds()
# convert difference column to seconds
df.loc[:, 'diff'] = df['diff'].apply(get_total_seconds)
# where vehicle IDs are the same and the difference between the start time of the current trip and end time of the
# prior trip is less than or equal to 30 minutes, change the start time of the current trip to the start time of the
# prior one
df.loc[((df['vehicleID'] == df['vehicleID'].shift(1)) & (df['diff'] <= 30*60)), 'start pos time'] = df['start pos time'].shift(1)
# create a new dataframe, grouped by vehicle ID and trip start time, using the maximum end time for each group
new_df = df.groupby(['vehicleID', 'start pos time'], as_index=False).agg({'end pos time':'max',
'duration (seconds)':'sum',
'meters travelled':'sum'})
EDIT: if there may be >2 trips that need to be aggregated (as @ouroboros1 pointed out) you can replace everything after the "convert difference column to seconds" code with:
# [based on @ouroboros1 solution] where vehicle IDs are the same and the difference between the start time of the current
# trip and end time of the prior trip is less than or equal to 30 minutes, put trips in the same "group"
df.loc[:, 'group'] = ((df['vehicleID'] != df['vehicleID'].shift(1)) | (df['diff'] > 30*60)).cumsum()
# create a new dataframe, grouped by vehicle ID and group, using the minimum start time and maximum end time for each group
new_df = df.groupby(['vehicleID', 'group'], as_index=False).agg({'start pos time':'min',
'end pos time':'max',
'duration (seconds)':'sum',
'meters travelled':'sum'})