python arrays apache-spark pyspark databricks

Replace empty strings with None inside a column of Array type in a Spark Dataframe

Say you have a dataframe that looks like the following:

df = spark.createDataFrame([
    ('test1', 7,['','hi',''], ['','',0]),
    ('', 14, ['','',6,''],[98,0,'',9])
], ["a", "b","c","d"])
df.show()

Column A	Column B	Column C	Column D
'test1'	7	['','hi','']	['','',0]
''	14	['','',6,'']	[98,0,'',9]

Is there a way to replace all of the empty strings inside the array columns (C and D) with None/null? It would end up looking like the following:

Column A	Column B	Column C	Column D
'test1'	7	[Null,'hi',Null]	[Null,Null,0]
Null	14	[Null,Null,6,Null]	[98,0,Null,9]

They key is I need to keep the positional value in each array, but I would like to have Null values instead of empty strings inside the arrays.

I have been able to turn the columns that are not arrays to None with the following code:

from pyspark.sql import functions as F
df=df.select([F.when(F.col(c)=="",None).otherwise(F.col(c)).alias(c) for c in df.columns])

I have looked at the array functions in databricks documentation here: https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin-alpha.html

Using array_remove() I could remove all of the empty strings within the arrays, but again the challenge being I need to keep the positional values of each array with a Null value. I cannot just remove the strings without replacing the value. Is there a way to do this?

Solution

You can simply use the transform function to apply a spark transformation to every element of an array:

df = spark.createDataFrame([
    (1, ["a", "", "b"]),
    (2, ["", "", "c"])
], ["id", "list"])
df.show()

+---+--------+
| id|    list|
+---+--------+
|  1|[a, , b]|
|  2| [, , c]|
+---+--------+

from pyspark.sql import functions as F
result = df.select(F.transform(F.col("list"),
    lambda x: F.when(x == "", F.lit(None)).otherwise(x)
  ).alias("list"))
result.show()

+---------------+                                                               
|           list|
+---------------+
|   [a, null, b]|
|[null, null, c]|
+---------------+