I'm trying to calculate the differential between the first Sent
date/time in an ID and the last Received
date/time in an ID, grouping them by Source and Destination. Sample (named test_subset
) looks like this (but it is '000s of rows):
| ID | From | To | leg_sent | leg_received | Source | Destination |
|:-----|:-----|:-----|:---------|:-------------|:-------|:------------|
| 1btr | ABC | XYZ | 08:22:23 | 08:22:41 | GB | FR |
| 1btr | XYZ | DEF | 08:22:49 | 08:23:05 | GB | FR |
| 2vyu | LMN | JFK | 14:35:11 | 14:35:23 | US | DE |
| 2vyu | JFK | HIJ | 14:35:35 | 14:35:48 | US | DE |
| 2vyu | HIJ | TPQ | 14:35:51 | 14:36:25 | US | DE |
In the data ID
is a transfer ID with the From
and To
rows being each of the legs of the transfer. In some instances their may be only 2 legs in others there may be as many as 10; what has to be calculated is the difference between first sent and last received, which won't be on the same row. The Source
and Destination
would remain the same across all rows relating to the same ID.
I currently have a piece of code (below) that works for me to calculate the differential between leg_sent
and leg_received
on the same row and grouping it by the From and To codes:
pairs = (
test_subset
.withColumn('"leg_time"', F.datediff('seconds', F.col('"leg_sent"'), F.col('"leg_received"')))
.groupBy(['"From"', '"To"'])
.agg(
F.count('*').alias('"num_paths"'),
F.avg('"leg_time"').alias('"average_leg_time"'),
F.min('"leg_time"').alias('"min_leg_time"'),
F.max('"leg_time"').alias('"max_leg_time"')
)
.withColumn('"leg_range"', F.col('"max_leg_time"') - F.col('"min_leg_time"'))
)
I need to calculate the full time differential by calculating the difference between first leg_sent
and last leg_received
for the same ID and grouping by Source and Destination. Ideally I want to modify the same code block above to suit this.
Thanks!
If I understand correctly to get the full time for the same id, source and destination, I think you can group by ID
, Source
and Destination
and get the MIN(leg_sent)
and MAX(leg_received)
and then calculate the difference
pairs = (
test_subset
.groupBy(['"ID"', '"Source"', '"Destination"'])
.agg(
F.min('"leg_sent"').alias('"min_leg_sent"'),
F.max('"leg_received"').alias('"max_leg_received"')
)
.withColumn('"leg_full_time"', F.datediff('seconds', F.col('"min_leg_sent"'), F.col('"max_leg_received"')))
)