Search code examples
pythondataframeapache-sparkpysparkapache-spark-sql

Pyspark - Repeat value until change in column


I have a dataframe with this structure

Order Number Line Number Item Type
12345 1 1001 Parent
12345 2 1002 Child
12345 3 1003 Child
12345 4 1004 Child
12345 5 1005 Parent
12345 6 1006 Child

I would like to add a column which shows the "Parent Item" for each item. The parent item is the first parent type that each child follows. There are no relationships or links to use. Line Number dictates the children for each parent.

Line Number Item Type Parent Item
1 1001 Parent 1001
2 1002 Child 1001
3 1003 Child 1001
4 1004 Child 1001
5 1005 Parent 1005
6 1006 Child 1005

The parent item number must repeat until a new parent is found. I have tried adding a LAG column to do checks but couldn't quite nail down the logic. I felt like I needed more than one column but couldn't do it.

I've also tried a window function to "group" them together by row number, partitioning by order number and type but that doesn't work as it separates the parents from the children.


Solution

  • Try this:

    from pyspark.sql import functions as F
    from pyspark.sql.window import Window
    
    df = df.withColumn(
        "Parent_Item",
        F.last(F.when(F.col("Type") == "Parent", F.col("Item")), ignorenulls=True).over(
            Window.partitionBy("Order Number").orderBy("Line Number")
        ),
    )
    
    df.show()
    

    Output:

    +------------+-----------+----+------+-----------+
    |Order Number|Line Number|Item|  Type|Parent_Item|
    +------------+-----------+----+------+-----------+
    |       12345|          1|1001|Parent|       1001|
    |       12345|          2|1002| Child|       1001|
    |       12345|          3|1003| Child|       1001|
    |       12345|          4|1004| Child|       1001|
    |       12345|          5|1005|Parent|       1005|
    |       12345|          6|1006| Child|       1005|
    +------------+-----------+----+------+-----------+