Search code examples
apache-sparkdatetimepysparktimestampaggregation

How to aggregate timestamp data in Spark to smaller time frame


I'm working on a project using New York taxi data. The data contain records for pickup location (PULocationID), and the timestamp (tpep_pickup_datetime) for that particular pick-up record.

sample of the data

I want to aggregate the data to be hourly for each location. The aggregation should have an hourly count of pick-ups per location.


Solution

  • The information you provided is a bit lacking. From what I understood, these could be possible aggregation options.

    Using date_trunc

    from pyspark.sql import functions as F
    
    df = df.groupBy(
            F.date_trunc('hour', 'tpep_pickup_datetime').alias('hour'),
            'PULocationID',
        ).count()
    
    df.show()
    # +-------------------+------------+-----+
    # |               hour|PULocationID|count|
    # +-------------------+------------+-----+
    # |2020-01-01 00:00:00|         238|    1|
    # |2020-01-01 02:00:00|         238|    2|
    # |2020-01-01 02:00:00|         193|    1|
    # |2020-01-01 01:00:00|         238|    2|
    # |2020-01-01 00:00:00|           7|    1|
    # +-------------------+------------+-----+
    

    Using window

    from pyspark.sql import functions as F
    
    df = df.groupBy(
            F.window('tpep_pickup_datetime', '1 hour').alias('hour'),
            'PULocationID',
        ).count()
    
    df.show(truncate=0)
    # +------------------------------------------+------------+-----+
    # |hour                                      |PULocationID|count|
    # +------------------------------------------+------------+-----+
    # |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|238         |2    |
    # |[2020-01-01 01:00:00, 2020-01-01 02:00:00]|238         |2    |
    # |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|238         |1    |
    # |[2020-01-01 02:00:00, 2020-01-01 03:00:00]|193         |1    |
    # |[2020-01-01 00:00:00, 2020-01-01 01:00:00]|7           |1    |