I am trying to convert empty strings to Null (None) and then write out in Parquet format.
However, the output is still an empty string and not Null (None). Any ideas what I need to change? I am using Spark 2.4
and Python 3
Note: I am checking columns for String Data Type before applying the below, but I have omitted for simplicity of this question.
from pyspark.sql import functions as psf
for column in df.columns:
df_new = df.withColumn(
column,
psf.when(
psf.length(
psf.trim(
psf.col(column)
)
) != 0,
psf.trim(psf.col(column))).otherwise(None))
Because you’re assigning the return value of the transformations to df_new
, and Spark transformations create a new DataFrame (DataFrames are immutable structures), you’re basically only changing the last (string) column.
df = spark.createDataFrame((("a", "b"), ("", "b"), ("a", "")), schema=("A", "B"))
for column in df.columns:
trimmed = psf.trim(psf.col(column))
df_new = df.withColumn(
column,
psf.when(psf.length(trimmed) != 0, trimmed)
.otherwise(None))
df_new.show()
# +---+----+
# | A| B|
# +---+----+
# | a| b|
# | | b|
# | a|null|
# +---+----+
However, if you change the line df_new = df.withColumn(
by df = df.withColumn(
, you’ll get the desired result:
for column in df.columns:
trimmed = psf.trim(psf.col(column))
df = df.withColumn(
column,
psf.when(psf.length(trimmed) != 0, trimmed)
.otherwise(None))
df.show()
# +----+----+
# | A| B|
# +----+----+
# | a| b|
# |null| b|
# | a|null|
# +----+----+