I have a dataframe with this structure
Order Number | Line Number | Item | Type |
---|---|---|---|
12345 | 1 | 1001 | Parent |
12345 | 2 | 1002 | Child |
12345 | 3 | 1003 | Child |
12345 | 4 | 1004 | Child |
12345 | 5 | 1005 | Parent |
12345 | 6 | 1006 | Child |
I would like to add a column which shows the "Parent Item" for each item. The parent item is the first parent type that each child follows. There are no relationships or links to use. Line Number dictates the children for each parent.
Line Number | Item | Type | Parent Item |
---|---|---|---|
1 | 1001 | Parent | 1001 |
2 | 1002 | Child | 1001 |
3 | 1003 | Child | 1001 |
4 | 1004 | Child | 1001 |
5 | 1005 | Parent | 1005 |
6 | 1006 | Child | 1005 |
The parent item number must repeat until a new parent is found. I have tried adding a LAG column to do checks but couldn't quite nail down the logic. I felt like I needed more than one column but couldn't do it.
I've also tried a window function to "group" them together by row number, partitioning by order number and type but that doesn't work as it separates the parents from the children.
Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = df.withColumn(
"Parent_Item",
F.last(F.when(F.col("Type") == "Parent", F.col("Item")), ignorenulls=True).over(
Window.partitionBy("Order Number").orderBy("Line Number")
),
)
df.show()
Output:
+------------+-----------+----+------+-----------+
|Order Number|Line Number|Item| Type|Parent_Item|
+------------+-----------+----+------+-----------+
| 12345| 1|1001|Parent| 1001|
| 12345| 2|1002| Child| 1001|
| 12345| 3|1003| Child| 1001|
| 12345| 4|1004| Child| 1001|
| 12345| 5|1005|Parent| 1005|
| 12345| 6|1006| Child| 1005|
+------------+-----------+----+------+-----------+