I'm trying to find the average of value within a range defined in a different table and store in a new column.
Table 1 will provide the start and stop times. Start time will be used as a start timestamp where the average calculation should start and stop time will be used as a stop timestamp where the average calculation should stop.
Table 1.
StartTime | StopTime |
---|---|
100 | 140 |
Table 2.
Timestamp | Value |
---|---|
80 | 15. |
90 | 10. |
100 | 13. |
110 | 9. |
120 | 19. |
130 | 38. |
140 | 1. |
150 | 39. |
Does someone know how to do this in PySpark?
According to the table 1 and 2, the average value will be (13 + 9 + 19 + 38 + 1) / 5 = 16 and it should be stored in a new column called Average Value in table 1.
Table 1.
StartTime | StopTime | AverageValue |
---|---|---|
100 | 140 | 16 |
I've tried the following lines, but it was not working.
w = Window.orderBy("Timestamp").rangeBetween(table_1.StartTime, table_1.StopTime)
table_1 = table_1.withColumn('AverageValue', F.avg(table_2.Value).over(w))
Join the dataframe such that Timestamp in table 2 is contained between the Start and Stop time interval from table 1 then group the joined dataframe and aggregate the value column
df3 = df1.join(df2, on=(df1['StartTime'] <= df2['Timestamp']) & (df1['StopTime'] >= df2['Timestamp']), how='left')
df3 = df3.groupby('StartTime', 'StopTime').agg(F.avg('Value').alias('AverageValue'))
df3.show()
+---------+--------+------------+
|StartTime|StopTime|AverageValue|
+---------+--------+------------+
| 100| 140| 16.0|
+---------+--------+------------+