Search code examples
jsonpython-3.xapache-sparkpyspark

How to show empty structs when reading from JSON using PySpark?


Why are my empty structs missing when I'm reading from JSON?

sample.json:

{
  "field_a": "hello",
  "field_b": {}
}

read.py:

df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json("sample.json")
df.printSchema()

output:

root
|-- field_a: string

expected output:

root
|-- field_a: string
|-- field_b: struct

I looked at the spark docs (https://spark.apache.org/docs/latest/sql-data-sources-json.html) and it does note that dropFieldIfAllNull is supposed to not drop the empty struct, but it doesn't seem like it works or I am misunderstanding what it does.


Solution

  • You can enforce a specific schema that you want yourself when reading in your json file. It looks something like this:

    from pyspark.sql.types import StructType, StructField, StringType
    
    # Define custom schema
    schema = StructType([
          StructField("field_a",StringType(),True),
          StructField("field_b",StructType(),True)
      ])
    
    df = spark.read.schema(schema).json("sample.json")
    
    df.printSchema()
    root                                                                                                                                                                                                                                                                            
     |-- field_a: string (nullable = true)                                                                                                                                                                                                                                          
     |-- field_b: struct (nullable = true)