apache-spark datetime pyspark timestamp aggregation

How to aggregate timestamp data in Spark to smaller time frame

I'm working on a project using New York taxi data. The data contain records for pickup location (PULocationID), and the timestamp (tpep_pickup_datetime) for that particular pick-up record.

sample of the data

I want to aggregate the data to be hourly for each location. The aggregation should have an hourly count of pick-ups per location.

Solution

The information you provided is a bit lacking. From what I understood, these could be possible aggregation options.

Using date_trunc

from pyspark.sql import functions as F

df = df.groupBy(
        F.date_trunc('hour', 'tpep_pickup_datetime').alias('hour'),
        'PULocationID',
    ).count()

df.show()
# +-------------------+------------+-----+
# |               hour|PULocationID|count|
# +-------------------+------------+-----+
# |2020-01-01 00:00:00|         238|    1|
# |2020-01-01 02:00:00|         238|    2|
# |2020-01-01 02:00:00|         193|    1|
# |2020-01-01 01:00:00|         238|    2|
# |2020-01-01 00:00:00|           7|    1|
# +-------------------+------------+-----+

Using window

from pyspark.sql import functions as F

df = df.groupBy(
        F.window('tpep_pickup_datetime', '1 hour').alias('hour'),
        'PULocationID',
    ).count()

df.show(truncate=0)
# +------------------------------------------+------------+-----+
# |hour                                      |PULocationID|count|
# +------------------------------------------+------------+-----+
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|238         |2    |
# |[2020-01-01 01:00:00, 2020-01-01 02:00:00]|238         |2    |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|238         |1    |
# |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|193         |1    |
# |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|7           |1    |