I'm working on a project using New York taxi data. The data contain records for pickup location (PULocationID
), and the timestamp (tpep_pickup_datetime
) for that particular pick-up record.
I want to aggregate the data to be hourly for each location. The aggregation should have an hourly count of pick-ups per location.
The information you provided is a bit lacking. From what I understood, these could be possible aggregation options.
Using date_trunc
from pyspark.sql import functions as F
df = df.groupBy(
F.date_trunc('hour', 'tpep_pickup_datetime').alias('hour'),
'PULocationID',
).count()
df.show()
# +-------------------+------------+-----+
# | hour|PULocationID|count|
# +-------------------+------------+-----+
# |2020-01-01 00:00:00| 238| 1|
# |2020-01-01 02:00:00| 238| 2|
# |2020-01-01 02:00:00| 193| 1|
# |2020-01-01 01:00:00| 238| 2|
# |2020-01-01 00:00:00| 7| 1|
# +-------------------+------------+-----+
Using window
from pyspark.sql import functions as F
df = df.groupBy(
F.window('tpep_pickup_datetime', '1 hour').alias('hour'),
'PULocationID',
).count()
df.show(truncate=0)
# +------------------------------------------+------------+-----+
# |hour |PULocationID|count|
# +------------------------------------------+------------+-----+
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|238 |2 |
# |[2020-01-01 01:00:00, 2020-01-01 02:00:00]|238 |2 |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|238 |1 |
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|193 |1 |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|7 |1 |