I'm consuming a REST API in Databricks using PySpark. The API response returns a list where each element of the list is a JSON string. When I parallelize the JSON, it yields a _corrupt_record column where each value of that column is a JSON string:
### API Call
response = requests.get(api_url, headers=api_call_header)
api_json = response.json()
df = spark.read.json(sc.parallelize(api_json))
display(df)
This is what the JSON string of a single value looks like when I copy it into a JSON validator:
{
'Var1': 'String',
'Var2': {
'Var3': 'String',
'Var4': None,
'Var5': 'String',
'Var6': 'String',
'Var7': 'String',
'Var8': 'String'
},
'Var9': None,
'Var10': 'String'
}
For whatever reason, I can't access the nested Struct objects of Var2. When I use the from_json function and the following from-scratch schema, it yields NULL values from Var2 onward:
schema = StructType([
StructField('Var1', StringType()),
StructField('Var2',
StructType([
StructField('Var3', StringType()),
StructField('Var4', NullType()),
StructField('Var5', StringType()),
StructField('Var6', StringType()),
StructField('Var7', StringType()),
StructField('Var8', StringType())
])
),
StructField('Var9', NullType()),
StructField('Var10', StringType())
])
This is my code attempting to parse the JSON string:
df = df.withColumn('struct_json', from_json(col('_corrupt_record'), schema))
That parses the first key:value pair but yields the rest of the column values as NULL:
*object:*
Var1: "String"
Var2: NULL
Var3: NULL
Var4: NULL
Var5: NULL
Var6: NULL
Var7: NULL
Var8: NULL
Var9: NULL
Var10: NULL
Any help would be much appreciated!
Attempted Solutions:
JSON Schema from Scratch - As mentioned above, it yields NULL values.
multiLine=True and allowSingleQuotes=True Read Options - Found this in another StackOverflow post but it still yielded NULL values when using my from-scratch JSON schema.
JSON Schema Using rdd.map Method - I tried to derive a schema using json_schema = spark.read.json(df.rdd.map(lambda row: row._corrupt_record)).schema
but that simply created a one-layer Struct object where the layer consisted of the entire JSON string without any nested objects parsed out.
SQL to Parse Key:Value Pairs - Too many nested objects and arrays to successfully parse and yielded too poor performance.
The answer to this was embarrassingly simple:
From the API call, api_json = response.json()
creates a Python dictionary. This was confirmed doing type(api_json)
.
Creating a DataFrame using the spark.read.json
method was incorrect since the source api_json
data was a dictionary not a JSON.
So the fix was changing this:
response = requests.get(api_url, headers=api_call_header)
api_json = response.json()
df = spark.read.json(sc.parallelize(api_json))
display(df)
To this:
response = requests.get(api_url, headers=api_call_header)
api_json = response.json()
df = spark.createDataFrame(api_json, schema=schema)
display(df)
For the schema, I used the one I had built from scratch in PySpark.