Search code examples
pysparkinlineexplode

Expanding / Exploding column values with Nulls into multiple rows in Pyspark


I have the following data as per below:

enter image description here

I would like to expand this data to get the following data:

enter image description here

I have tried the following code:

df = (df
  .selectExpr("id", 
              "split(col1, ',') col1", 
              "split(col2, ',') col2",
              "split(col3, ',') col3")
  .withColumn("arr", F.expr("arrays_zip(col1, col2, col3)"))
  .selectExpr("id", "inline(arr)"))
  df.show()

This code result in 0 rows. So I have tried to use the command 'inline_outer':

    df = (df
  .selectExpr("id", 
              "split(col1, ',') col1", 
              "split(col2, ',') col2",
              "split(col3, ',') col3")
  .withColumn("arr", F.expr("arrays_zip(col1, col2, col3)"))
  .selectExpr("id", "inline_outer(arr)"))
  df.show()

This code brings back all the rows but without any value.

I am at this stage puzzled as to why this happens as if I expose the original column (e.g. col1 without splitting, etc.) the values are there.

Any suggestions are welcome.


Solution

  • using arrays_zip with null value returns null. For your case, we need empty array instead of null.

    cnd = "arrays_zip(coalesce(col1,array()),coalesce(col2,array()),coalesce(col3,array()))"
    
    df = df.selectExpr("id","split(col1, ',') col1","split(col2, ',') col2","split(col3, ',') col3").\
    withColumn("arr", F.expr(cnd)).selectExpr("id", "inline_outer(arr)")
    
    df.show()
    
    +---+----+----+----+
    |id |0   |1   |2   |
    +---+----+----+----+
    |x  |1   |a   |c   |
    |x  |2   |b   |c   |
    |x  |3   |c   |c   |
    |y  |null|null|null|
    |z  |12  |null|c   |
    |z  |3   |null|null|
    +---+----+----+----+