I have the following data as per below:
I would like to expand this data to get the following data:
I have tried the following code:
df = (df
.selectExpr("id",
"split(col1, ',') col1",
"split(col2, ',') col2",
"split(col3, ',') col3")
.withColumn("arr", F.expr("arrays_zip(col1, col2, col3)"))
.selectExpr("id", "inline(arr)"))
df.show()
This code result in 0 rows. So I have tried to use the command 'inline_outer':
df = (df
.selectExpr("id",
"split(col1, ',') col1",
"split(col2, ',') col2",
"split(col3, ',') col3")
.withColumn("arr", F.expr("arrays_zip(col1, col2, col3)"))
.selectExpr("id", "inline_outer(arr)"))
df.show()
This code brings back all the rows but without any value.
I am at this stage puzzled as to why this happens as if I expose the original column (e.g. col1 without splitting, etc.) the values are there.
Any suggestions are welcome.
using arrays_zip
with null
value returns null
. For your case, we need empty array instead of null.
cnd = "arrays_zip(coalesce(col1,array()),coalesce(col2,array()),coalesce(col3,array()))"
df = df.selectExpr("id","split(col1, ',') col1","split(col2, ',') col2","split(col3, ',') col3").\
withColumn("arr", F.expr(cnd)).selectExpr("id", "inline_outer(arr)")
df.show()
+---+----+----+----+
|id |0 |1 |2 |
+---+----+----+----+
|x |1 |a |c |
|x |2 |b |c |
|x |3 |c |c |
|y |null|null|null|
|z |12 |null|c |
|z |3 |null|null|
+---+----+----+----+