Search code examples
pythondataframepyspark

Find average of value within a range defined in a different table


I'm trying to find the average of value within a range defined in a different table and store in a new column.

Table 1 will provide the start and stop times. Start time will be used as a start timestamp where the average calculation should start and stop time will be used as a stop timestamp where the average calculation should stop.

Table 1.

StartTime StopTime
100 140

Table 2.

Timestamp Value
80 15.
90 10.
100 13.
110 9.
120 19.
130 38.
140 1.
150 39.

Does someone know how to do this in PySpark?

According to the table 1 and 2, the average value will be (13 + 9 + 19 + 38 + 1) / 5 = 16 and it should be stored in a new column called Average Value in table 1.

Table 1.

StartTime StopTime AverageValue
100 140 16

I've tried the following lines, but it was not working.

w = Window.orderBy("Timestamp").rangeBetween(table_1.StartTime, table_1.StopTime)
table_1 = table_1.withColumn('AverageValue', F.avg(table_2.Value).over(w))

Solution

  • Join the dataframe such that Timestamp in table 2 is contained between the Start and Stop time interval from table 1 then group the joined dataframe and aggregate the value column

    df3 = df1.join(df2, on=(df1['StartTime'] <= df2['Timestamp']) & (df1['StopTime'] >= df2['Timestamp']), how='left')
    df3 = df3.groupby('StartTime', 'StopTime').agg(F.avg('Value').alias('AverageValue'))
    

    df3.show()
    
    +---------+--------+------------+
    |StartTime|StopTime|AverageValue|
    +---------+--------+------------+
    |      100|     140|        16.0|
    +---------+--------+------------+