I have the following dummy data:
data = {"trip_id": ["a", "a", "a", "a", "b", "b", "b", "c", "c", "c", "c", "c"],
"distance": [5, 10, 2, 12, 20, 1, 15, 2, 3, 5, 0, 28],
"value": [32, 46, 132, 45, 68, 123, 56, 54, 32, 89, 10, 39],
"timestamp": ["2021-05-09 10:12:18", "2021-06-19 08:22:18", "2021-01-09 15:20:00", "2021-01-09 18:30:18",
"2021-04-09 24:11:18", "2021-02-19 08:12:18", "2021-06-11 08:56:18", "2021-06-11 08:54:18",
"2021-06-09 18:12:18", "2021-06-19 16:10:18", "2021-06-03 18:12:18", "2021-06-12 08:22:18"]
}
df = pd.DataFrame(data, columns = ["trip_id", "distance", "value", "timestamp"])
df
>>
trip_id distance value timestamp
0 a 5 32 2021-05-09 10:12:18
1 a 10 46 2021-06-19 08:22:18
2 a 2 132 2021-01-09 15:20:00
3 a 12 45 2021-01-09 18:30:18
4 b 20 68 2021-04-09 24:11:18
5 b 1 123 2021-02-19 08:12:18
6 b 15 56 2021-06-11 08:56:18
7 c 2 54 2021-06-11 08:54:18
8 c 3 32 2021-06-09 18:12:18
9 c 5 89 2021-06-19 16:10:18
10 c 0 10 2021-06-03 18:12:18
11 c 28 39 2021-06-12 08:22:18
I would like to compact the values per trip_id every 30m of the distance column. For this, I am trying:
current_trip_id = ""
dist_segment = []
for index,row in df.iterrows():
if row["trip_id"] == current_trip_id:
if sum(dist_segment) <= 30:
# compute trip characteristics every 30m
sorted_trip_char = df.sort_values(["trip_id", "timestamp"])
grouped_trip_char = sorted_trip_char.groupby("trip_id")
agg_trip_char = grouped_trip_char.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
})
else:
# store trip characteristics to another line up to 30m
print("More than 30m in index", index)
# Update loop
current_trip_id = row["trip_id"]
dist_segment.append(row["distance"])
However, I can't see how I can still preserve the id and jump into a new aggregation of values.
This is the output I'm after:
agg_trip_char
>>
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-01-09 15:20:00 2021-06-19 08:22:18
b b 21 95.50 2021-02-19 08:12:18 2021-06-11 08:56:18
b b 15 56 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 10 46.25 2021-06-11 08:54:18 2021-06-03 18:12:18
c c 28 39 2021-06-12 08:22:18 2021-06-12 08:22:18
You can modify your codes as follows:
df_out = df.sort_values(["trip_id", "timestamp"])
trip_id
g = df_out.groupby('trip_id')['distance'].cumsum() // 30
Here, we get the cummulative sum of distance
within a particular trip_id
by using .groupby()
+ .cumsum()
. Then, use integer division by 30 to divide every 30m into different group numbers.
df_out2 = (df_out.groupby(['trip_id', g])
.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
}).droplevel(-1)
)
Result:
print(df_out2)
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-01-09 15:20:00 2021-06-19 08:22:18
b b 21 95.50 2021-02-19 08:12:18 2021-04-09 24:11:18
b b 15 56.00 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 5 32.00 2021-06-03 18:12:18 2021-06-11 08:54:18
c c 33 64.00 2021-06-12 08:22:18 2021-06-19 16:10:18
If we don't sort the columns, the result will look more alike to your expected output:
#df_out = df.sort_values(["trip_id", "timestamp"])
df_out = df
g = df_out.groupby('trip_id')['distance'].cumsum() // 30
df_out2 = (df_out.groupby(['trip_id', g])
.agg({'trip_id': 'first',
'distance':'sum',
'value': 'mean',
'timestamp':['first', 'last']
}).droplevel(-1)
)
Result:
print(df_out2)
trip_id distance value timestamp
first sum mean first last
trip_id
a a 29 63.75 2021-05-09 10:12:18 2021-01-09 18:30:18
b b 21 95.50 2021-04-09 24:11:18 2021-02-19 08:12:18
b b 15 56.00 2021-06-11 08:56:18 2021-06-11 08:56:18
c c 10 46.25 2021-06-11 08:54:18 2021-06-03 18:12:18
c c 28 39.00 2021-06-12 08:22:18 2021-06-12 08:22:18
However, it seems reasonable to sort the trip_id
and timestamp
if you need to follow the time sequence. Hence, please review your real-life situation and choose from the options accordingly.