How to show empty structs when reading from JSON using PySpark?

Why are my empty structs missing when I'm reading from JSON?

sample.json:

{
  "field_a": "hello",
  "field_b": {}
}

read.py:

df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json("sample.json")
df.printSchema()

output:

root
|-- field_a: string

expected output:

root
|-- field_a: string
|-- field_b: struct

I looked at the spark docs (https://spark.apache.org/docs/latest/sql-data-sources-json.html) and it does note that dropFieldIfAllNull is supposed to not drop the empty struct, but it doesn't seem like it works or I am misunderstanding what it does.

Solution

You can enforce a specific schema that you want yourself when reading in your json file. It looks something like this:

from pyspark.sql.types import StructType, StructField, StringType

# Define custom schema
schema = StructType([
      StructField("field_a",StringType(),True),
      StructField("field_b",StructType(),True)
  ])

df = spark.read.schema(schema).json("sample.json")

df.printSchema()
root                                                                                                                                                                                                                                                                            
 |-- field_a: string (nullable = true)                                                                                                                                                                                                                                          
 |-- field_b: struct (nullable = true)