Say you have a dataframe that looks like the following:
df = spark.createDataFrame([
('test1', 7,['','hi',''], ['','',0]),
('', 14, ['','',6,''],[98,0,'',9])
], ["a", "b","c","d"])
df.show()
Column A | Column B | Column C | Column D |
---|---|---|---|
'test1' | 7 | ['','hi',''] | ['','',0] |
'' | 14 | ['','',6,''] | [98,0,'',9] |
Is there a way to replace all of the empty strings inside the array columns (C and D) with None/null? It would end up looking like the following:
Column A | Column B | Column C | Column D |
---|---|---|---|
'test1' | 7 | [Null,'hi',Null] | [Null,Null,0] |
Null | 14 | [Null,Null,6,Null] | [98,0,Null,9] |
They key is I need to keep the positional value in each array, but I would like to have Null values instead of empty strings inside the arrays.
I have been able to turn the columns that are not arrays to None with the following code:
from pyspark.sql import functions as F
df=df.select([F.when(F.col(c)=="",None).otherwise(F.col(c)).alias(c) for c in df.columns])
I have looked at the array functions in databricks documentation here: https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin-alpha.html
Using array_remove() I could remove all of the empty strings within the arrays, but again the challenge being I need to keep the positional values of each array with a Null value. I cannot just remove the strings without replacing the value. Is there a way to do this?
You can simply use the transform
function to apply a spark transformation to every element of an array:
df = spark.createDataFrame([
(1, ["a", "", "b"]),
(2, ["", "", "c"])
], ["id", "list"])
df.show()
+---+--------+
| id| list|
+---+--------+
| 1|[a, , b]|
| 2| [, , c]|
+---+--------+
from pyspark.sql import functions as F
result = df.select(F.transform(F.col("list"),
lambda x: F.when(x == "", F.lit(None)).otherwise(x)
).alias("list"))
result.show()
+---------------+
| list|
+---------------+
| [a, null, b]|
|[null, null, c]|
+---------------+