I would like to create a new column in a pyspark.sql.DataFrame based on lagged values of an existing column. But... I would also like the last values to become the first ones, and the first values to become the last ones. Here is an example :
df = spark.createDataFrame([(1,100),
(2,200),
(3,300),
(4,400),
(5,500)],
['id','value'])
df.show()
+---+-----+
| id|value|
+---+-----+
| 1| 100|
| 2| 200|
| 3| 300|
| 4| 400|
| 5| 500|
+---+-----+
And the desired output would be :
+---+-----+----------------+-----------------+
| id|value|lag_value_plus_2|lag_value_minus_2|
+---+-----+----------------+-----------------+
| 1| 100| 300| 400|
| 2| 200| 400| 500|
| 3| 300| 500| 100|
| 4| 400| 100| 200|
| 5| 500| 200| 300|
+---+-----+----------------+-----------------+
I can feel it has something to do with window functions or pyspark.sql.lag function, but can't figure out how to do.
Here is one solution I can offer. But I'm not sure it is the most optimized one :
from functools import reduce
# Duplicate the dataframe twice, one "before" and one "after"
df = reduce(
lambda a, b : a.union(b),
[df.withColumn("x", F.lit(i)) for i in [-1,0,1]]
)
df.withColumn(
"lag_value_plus_2",
F.lead("value", 2).over(Window.partitionBy().orderBy("x", "id"))
).withColumn(
"lag_value_minus_2",
F.lag("value", 2).over(Window.partitionBy().orderBy("x", "id"))
).where("x=0").drop("x").show()
+---+-----+----------------+-----------------+
| id|value|lag_value_plus_2|lag_value_minus_2|
+---+-----+----------------+-----------------+
| 1| 100| 300| 400|
| 2| 200| 400| 500|
| 3| 300| 500| 100|
| 4| 400| 100| 200|
| 5| 500| 200| 300|
+---+-----+----------------+-----------------+