I want to assign value of previous id into immediate next id.
For example, "id1" has value is "ab" and "id2" has value "ac".
I want to get the output "id2" has value "ab" "ac".
I have dataframe df as below:
id value1
id1 ab
id1 ab
id2 ac
id2 ac
id3 abc
id3 abc
id3 abc
desired output
id value1 value2
id1 ab
id1 ab
id2 ac ab
id2 ac ab
id3 abc ac
id3 abc ac
id3 abc ac
I used the following script
val w1 = Window.orderBy("id")
val snDF = df.withColumn("value2", lag($"value1", 2).over(w1))
But it gives me:
id value1 value2
id1 ab
id1 ab
id2 ac ab
id2 ac ab
id3 abc ac
id3 abc ac
id3 abc abc
It is not the correct ouput. How can I get it ?
Thanks
Doing the following should work for you
import org.apache.spark.sql.expressions._
val w1 = Window.orderBy("id")
import org.apache.spark.sql.functions._
df.groupBy("id", "value1")
.agg(collect_list("value1").as("temp"))
.withColumn("value2", lag($"value1", 1).over(w1))
.withColumn("temp", explode(col("temp")))
.drop("temp")
.show(false)
You would get dataframe as
+---+------+------+
|id |value1|value2|
+---+------+------+
|id1|ab |null |
|id1|ab |null |
|id2|ac |ab |
|id2|ac |ab |
|id3|abc |ac |
|id3|abc |ac |
|id3|abc |ac |
+---+------+------+