python json pyspark databricks azure-databricks

Json & PySpark - read value from a struct that may be null

I have a list of .json files that contain person information. One file contains info of one person. I want to load this data into table using pyspark in an Azure Databricks notebook.

Let's say the files are built like this:

{
    "id": 1,
    "name": "Homer",
    "address": {
        "street": "742 Evergreen Terrace"
        "city": "Springfield"
    }
}

Fairly simple json here, which i can read into a datafrom with this code:

from pyspark.sql.functions import *

sourcejson = spark.read.json("path/to/json")

df = (
    sourcejson.select(
        col('id'),
        col('name'),
        col('address.street').alias('street'),
        col('address.city').alias('city')
    )
)

which gives the expected result:

id | name  | street                | city
1  | Homer | 742 Evergreen Terrace | Springfield

However. The problem start when the address is unknown. In that case, the whole address struct in the json will just be null:

{
    "id": 2,
    "name": "Ned",
    "address": null
}

In the example file above, we don't know Ned's address so we have a null. Using the code from before, I would expect a result like this:

id | name | street | city
2  | Ned  | null   | null

however, running the code results in an error:

[INVALID_EXTRACT_BASE_FIELD_TYPE] Can't extract a value from "address". Need a complex type [STRUCT, ARRAY, MAP] but got "STRING"

I understand the reason behind the error but I can't find any solution on it. Any idea's how we could handle this?

Solution

When you don't provide a schema for spark.read.json, it will be inferred from the data. So when the address is missing in all objects, Spark assumes it is a StringType and that's why you are getting the error. One possible solution is to read the data with a schema:

from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, LongType

raw_data = spark.sparkContext.parallelize(
    ['{"id": 2, "name": "Marge", "address": null}']
)
address_struct = StructType([
    StructField('street', StringType(), True),
    StructField('city', StringType(), True),
])
schema = StructType([
    StructField('id', LongType(), True),
    StructField('name', StringType(), True),
    StructField('address', address_struct, True),
])
sourcejson = spark.read.json(raw_data, schema=schema)

res = (
    sourcejson.select(
        F.col('id'),
        F.col('name'),
        F.col('address.street').alias('street'),
        F.col('address.city').alias('city')
    )
)
res.show(10, False)

# +---+-----+------+----+
# |id |name |street|city|
# +---+-----+------+----+
# |2  |Marge|NULL  |NULL|
# +---+-----+------+----+