I have a list of .json files
that contain person information. One file contains info of one person. I want to load this data into table using pyspark
in an Azure Databricks notebook.
Let's say the files are built like this:
{
"id": 1,
"name": "Homer",
"address": {
"street": "742 Evergreen Terrace"
"city": "Springfield"
}
}
Fairly simple json here, which i can read into a datafrom with this code:
from pyspark.sql.functions import *
sourcejson = spark.read.json("path/to/json")
df = (
sourcejson.select(
col('id'),
col('name'),
col('address.street').alias('street'),
col('address.city').alias('city')
)
)
which gives the expected result:
id | name | street | city
1 | Homer | 742 Evergreen Terrace | Springfield
However. The problem start when the address is unknown. In that case, the whole address struct in the json will just be null
:
{
"id": 2,
"name": "Ned",
"address": null
}
In the example file above, we don't know Ned's address so we have a null. Using the code from before, I would expect a result like this:
id | name | street | city
2 | Ned | null | null
however, running the code results in an error:
[INVALID_EXTRACT_BASE_FIELD_TYPE] Can't extract a value from "address". Need a complex type [STRUCT, ARRAY, MAP] but got "STRING"
I understand the reason behind the error but I can't find any solution on it. Any idea's how we could handle this?
When you don't provide a schema for spark.read.json
, it will be inferred from the data. So when the address
is missing in all objects, Spark assumes it is a StringType
and that's why you are getting the error. One possible solution is to read the data with a schema:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, LongType
raw_data = spark.sparkContext.parallelize(
['{"id": 2, "name": "Marge", "address": null}']
)
address_struct = StructType([
StructField('street', StringType(), True),
StructField('city', StringType(), True),
])
schema = StructType([
StructField('id', LongType(), True),
StructField('name', StringType(), True),
StructField('address', address_struct, True),
])
sourcejson = spark.read.json(raw_data, schema=schema)
res = (
sourcejson.select(
F.col('id'),
F.col('name'),
F.col('address.street').alias('street'),
F.col('address.city').alias('city')
)
)
res.show(10, False)
# +---+-----+------+----+
# |id |name |street|city|
# +---+-----+------+----+
# |2 |Marge|NULL |NULL|
# +---+-----+------+----+