Search code examples
pythonarraysapache-sparkpysparkdatabricks

Replace empty strings with None inside a column of Array type in a Spark Dataframe


Say you have a dataframe that looks like the following:

df = spark.createDataFrame([
    ('test1', 7,['','hi',''], ['','',0]),
    ('', 14, ['','',6,''],[98,0,'',9])
], ["a", "b","c","d"])
df.show()
Column A Column B Column C Column D
'test1' 7 ['','hi',''] ['','',0]
'' 14 ['','',6,''] [98,0,'',9]

Is there a way to replace all of the empty strings inside the array columns (C and D) with None/null? It would end up looking like the following:

Column A Column B Column C Column D
'test1' 7 [Null,'hi',Null] [Null,Null,0]
Null 14 [Null,Null,6,Null] [98,0,Null,9]

They key is I need to keep the positional value in each array, but I would like to have Null values instead of empty strings inside the arrays.

I have been able to turn the columns that are not arrays to None with the following code:

from pyspark.sql import functions as F
df=df.select([F.when(F.col(c)=="",None).otherwise(F.col(c)).alias(c) for c in df.columns])

I have looked at the array functions in databricks documentation here: https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin-alpha.html

Using array_remove() I could remove all of the empty strings within the arrays, but again the challenge being I need to keep the positional values of each array with a Null value. I cannot just remove the strings without replacing the value. Is there a way to do this?


Solution

  • You can simply use the transform function to apply a spark transformation to every element of an array:

    df = spark.createDataFrame([
        (1, ["a", "", "b"]),
        (2, ["", "", "c"])
    ], ["id", "list"])
    df.show()
    
    +---+--------+
    | id|    list|
    +---+--------+
    |  1|[a, , b]|
    |  2| [, , c]|
    +---+--------+
    
    from pyspark.sql import functions as F
    result = df.select(F.transform(F.col("list"),
        lambda x: F.when(x == "", F.lit(None)).otherwise(x)
      ).alias("list"))
    result.show()
    
    +---------------+                                                               
    |           list|
    +---------------+
    |   [a, null, b]|
    |[null, null, c]|
    +---------------+