I have a dataframe which represents grayscale image
+---+---+-----+
| w| h|color|
+---+---+-----+
| 0| 0|255.0|
| 0| 1|255.0|
| 0| 2|255.0|
| 0| 3|255.0|
| 0| 4|255.0|
| 0| 5|255.0|
| 0| 6|255.0|
| 0| 7|255.0|
| 0| 8|255.0|
| 0| 9|255.0|
| 0| 10|255.0|
| 0| 11|255.0|
| 1| 0|255.0|
| 1| 1|255.0|
| 1| 2|255.0|
| 1| 3|255.0|
| 1| 4|255.0|
| 1| 5|255.0|
| 1| 6|255.0|
| 1| 7|255.0|
+---+---+-----+
top 20 rows
For each row I need to sum "color" when values from "w" and "h" are in range from current value to current value plus a number.
To better understading, possible solution would look like this:
val windowW = Window.rangeBetween(Window.currentRow, Window.currentRow + num1)
val windowH = Window.rangeBetween(Window.currentRow, Window.currentRow + num2)
df.withColumn("color_sum", sum(col("color")).over(col("w").windowW and col("h").windowH))
Could you please give me some hints how to achieve this calculation?
Expected output for the very first row:
+---+---+-----+----------+
| w| h|color|sum(color)|
+---+---+-----+----------+
| 0| 0|255.0| 1020|
+---+---+-----+----------+
Where num1 and num2 are both equals 1.
That means sum is taken from rows:
(0, 0), (0, 1), (1, 0), (1, 1)
For row (1, 1) sum would be taken from rows (1, 1), (1, 2), (2, 1), (2, 2).
You can't use window functions to apply a function on 2-dimensional data with these constraints.
You can use the join method to find the rows:
df.as("df1")
.join(df.as("df2"),
((col("df1.w") - col("df2.w") <= 0) && col("df1.w") - col("df2.w") >= -1) &&
((col("df1.h") - col("df2.h") <= 0) && col("df1.h") - col("df2.h") >= -1),
"inner"
)
.groupBy("df1.w", "df1.h")
.agg(min("df1.color") as "color", sum("df2.color") as "sum")
.orderBy("w", "h")
.show()