python dataframe apache-spark pyspark apache-spark-sql

Pyspark - Repeat value until change in column

I have a dataframe with this structure

Order Number	Line Number	Item	Type
12345	1	1001	Parent
12345	2	1002	Child
12345	3	1003	Child
12345	4	1004	Child
12345	5	1005	Parent
12345	6	1006	Child

I would like to add a column which shows the "Parent Item" for each item. The parent item is the first parent type that each child follows. There are no relationships or links to use. Line Number dictates the children for each parent.

Line Number	Item	Type	Parent Item
1	1001	Parent	1001
2	1002	Child	1001
3	1003	Child	1001
4	1004	Child	1001
5	1005	Parent	1005
6	1006	Child	1005

The parent item number must repeat until a new parent is found. I have tried adding a LAG column to do checks but couldn't quite nail down the logic. I felt like I needed more than one column but couldn't do it.

I've also tried a window function to "group" them together by row number, partitioning by order number and type but that doesn't work as it separates the parents from the children.

Solution

Try this:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

df = df.withColumn(
    "Parent_Item",
    F.last(F.when(F.col("Type") == "Parent", F.col("Item")), ignorenulls=True).over(
        Window.partitionBy("Order Number").orderBy("Line Number")
    ),
)

df.show()

Output:

+------------+-----------+----+------+-----------+
|Order Number|Line Number|Item|  Type|Parent_Item|
+------------+-----------+----+------+-----------+
|       12345|          1|1001|Parent|       1001|
|       12345|          2|1002| Child|       1001|
|       12345|          3|1003| Child|       1001|
|       12345|          4|1004| Child|       1001|
|       12345|          5|1005|Parent|       1005|
|       12345|          6|1006| Child|       1005|
+------------+-----------+----+------+-----------+