Search code examples
pythonapache-sparkpysparkparquet

PySpark - how to replace null array in JSON file


I have JSON files with null arrays (array (nullable = true) and element: (containsNull = true)) that I'd like to convert to parquet files. These null fields are dropped automatically, while all other columns are converted as expected. Is there a way to replace the null arrays with something else (eg ["-"])? I'm running my code in AWS Glue, but the replacement would take place using pure PySpark and dataframes.

The data type must still be an array in the parquet file (just not with null values).

json:

{
 "a":[],
 "b":[1,2],
 "c": "a string"
}

Expected output:

   a  |   b   |     c
["-"] | [1,2] | "a string"

Current script:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
spark.conf.set("spark.sql.jsonGenerator.ignoreNullFields", "false")

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

inputDF = glueContext.create_dynamic_frame_from_options(connection_type="s3",
                                                        connection_options:{
                                                             "s3://my-bucket", 
                                                             "recurse":True
                                                           }, 
                                                        format="json")
df = inputDF.toDF()
# replace empty arrays here
dynamic = DynamicFrame.fromDF(df, glueContext, "dynamic")

output = glueContext.write_dynamic_frame.from_options(frame=dynamic,
                                                      connection_type="s3",
                                                      connection_options={
                                                           "path":"s3://my-bucket"
                                                           },
                                                      format="parquet"
                                                      )

Solution

  • Not sure which spark version you are using, I have checked with spark 3.1.2 and 2.4.5 where empty array fields are not getting ignored.

    You can use below line to get desired result,

    df.withColumn('a', when(size('a')== 0, array(lit('-'))).otherwise(col('a'))).show()
    
    +---+------+--------+
    |  a|     b|       c|
    +---+------+--------+
    |[-]|[1, 2]|a string|
    +---+------+--------+