Search code examples
pythonpysparkparquet

PySpark convert empty string to null and write in Parquet


I am trying to convert empty strings to Null (None) and then write out in Parquet format.

However, the output is still an empty string and not Null (None). Any ideas what I need to change? I am using Spark 2.4 and Python 3

Note: I am checking columns for String Data Type before applying the below, but I have omitted for simplicity of this question.

from pyspark.sql import functions as psf

for column in df.columns:
    df_new = df.withColumn(
        column,
        psf.when(
            psf.length(
                psf.trim(
                    psf.col(column)
                )
            ) != 0,
            psf.trim(psf.col(column))).otherwise(None))

Solution

  • Because you’re assigning the return value of the transformations to df_new, and Spark transformations create a new DataFrame (DataFrames are immutable structures), you’re basically only changing the last (string) column.

    df = spark.createDataFrame((("a", "b"), ("", "b"), ("a", "")), schema=("A", "B"))
    
    for column in df.columns:
        trimmed = psf.trim(psf.col(column))
        df_new = df.withColumn(
            column,
            psf.when(psf.length(trimmed) != 0, trimmed)
            .otherwise(None))
    df_new.show()
    # +---+----+
    # |  A|   B|
    # +---+----+
    # |  a|   b|
    # |   |   b|
    # |  a|null|
    # +---+----+
    

    However, if you change the line df_new = df.withColumn( by df = df.withColumn(, you’ll get the desired result:

    for column in df.columns:
        trimmed = psf.trim(psf.col(column))
        df = df.withColumn(
            column,
            psf.when(psf.length(trimmed) != 0, trimmed)
            .otherwise(None))
    
    df.show()
    # +----+----+
    # |   A|   B|
    # +----+----+
    # |   a|   b|
    # |null|   b|
    # |   a|null|
    # +----+----+