Why are my empty structs missing when I'm reading from JSON?
sample.json:
{
"field_a": "hello",
"field_b": {}
}
read.py:
df = spark.read.options(multiline=True, dropFieldIfAllNull=False).json("sample.json")
df.printSchema()
output:
root
|-- field_a: string
expected output:
root
|-- field_a: string
|-- field_b: struct
I looked at the spark docs (https://spark.apache.org/docs/latest/sql-data-sources-json.html) and it does note that dropFieldIfAllNull
is supposed to not drop the empty struct, but it doesn't seem like it works or I am misunderstanding what it does.
You can enforce a specific schema that you want yourself when reading in your json file. It looks something like this:
from pyspark.sql.types import StructType, StructField, StringType
# Define custom schema
schema = StructType([
StructField("field_a",StringType(),True),
StructField("field_b",StructType(),True)
])
df = spark.read.schema(schema).json("sample.json")
df.printSchema()
root
|-- field_a: string (nullable = true)
|-- field_b: struct (nullable = true)