Search code examples
dataframeapache-sparkpysparkaverage

How can I find the average of every nth number of rows in PySpark


I have 1440 rows in my dataframe (one row for every minute of the day). I want to convert this into hours so that I have 24 values (rows) left in total.

This is a 2 column dataframe. First column is minutes, second column is integers. I would like a 2 X 24 dataframe where the first column is hours and the second column is an average of 60 values.


Solution

  • If your minutes column is an integer starting at 0, something along these lines should work:

    hour = F.floor(F.col('minute') / 60).alias('hour')
    df = df.groupBy(hour).agg(F.avg('integer').alias('average'))
    

    The example where I assume that every hour has 3 minutes:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [(0, 5),
         (1, 5),
         (2, 5),
         (3, 0),
         (4, 0),
         (5, 1)],
        ['minute', 'integer'])
    
    hour = F.floor(F.col('minute') / 3).alias('hour')
    df = df.groupBy(hour).agg(F.avg('integer').alias('average'))
    
    df.show()
    # +----+------------------+
    # |hour|           average|
    # +----+------------------+
    # |   0|               5.0|
    # |   1|0.3333333333333333|
    # +----+------------------+