Search code examples
jsonpysparknulldatabrickscorrupt

_corrupt_record Column in Databricks Yields NULL Values When Using JSON Schema (PySpark)


I'm consuming a REST API in Databricks using PySpark. The API response returns a list where each element of the list is a JSON string. When I parallelize the JSON, it yields a _corrupt_record column where each value of that column is a JSON string:

### API Call
response = requests.get(api_url, headers=api_call_header)
api_json = response.json()
df = spark.read.json(sc.parallelize(api_json))
display(df)

This is what the JSON string of a single value looks like when I copy it into a JSON validator:

{
    'Var1': 'String',
    'Var2': {
        'Var3': 'String',
        'Var4': None,
        'Var5': 'String',
        'Var6': 'String',
        'Var7': 'String',
        'Var8': 'String'
    },
    'Var9': None,
    'Var10': 'String'
}

For whatever reason, I can't access the nested Struct objects of Var2. When I use the from_json function and the following from-scratch schema, it yields NULL values from Var2 onward:

schema = StructType([
     StructField('Var1', StringType()),
     StructField('Var2', 
       StructType([
         StructField('Var3', StringType()),
         StructField('Var4', NullType()),
         StructField('Var5', StringType()),
         StructField('Var6', StringType()),
         StructField('Var7', StringType()),
         StructField('Var8', StringType())
       ])
     ),
     StructField('Var9', NullType()),
     StructField('Var10', StringType())
])

This is my code attempting to parse the JSON string: df = df.withColumn('struct_json', from_json(col('_corrupt_record'), schema))

That parses the first key:value pair but yields the rest of the column values as NULL:

*object:*
    Var1: "String"
    Var2: NULL
    Var3: NULL
    Var4: NULL
    Var5: NULL
    Var6: NULL
    Var7: NULL
    Var8: NULL
    Var9: NULL
    Var10: NULL

Any help would be much appreciated!

Attempted Solutions:

  1. JSON Schema from Scratch - As mentioned above, it yields NULL values.

  2. multiLine=True and allowSingleQuotes=True Read Options - Found this in another StackOverflow post but it still yielded NULL values when using my from-scratch JSON schema.

  3. JSON Schema Using rdd.map Method - I tried to derive a schema using json_schema = spark.read.json(df.rdd.map(lambda row: row._corrupt_record)).schema but that simply created a one-layer Struct object where the layer consisted of the entire JSON string without any nested objects parsed out.

  4. SQL to Parse Key:Value Pairs - Too many nested objects and arrays to successfully parse and yielded too poor performance.


Solution

  • The answer to this was embarrassingly simple:

    From the API call, api_json = response.json() creates a Python dictionary. This was confirmed doing type(api_json).

    Creating a DataFrame using the spark.read.json method was incorrect since the source api_json data was a dictionary not a JSON.

    So the fix was changing this:

    response = requests.get(api_url, headers=api_call_header)
    api_json = response.json()
    df = spark.read.json(sc.parallelize(api_json))
    display(df)
    

    To this:

    response = requests.get(api_url, headers=api_call_header)
    api_json = response.json()
    df = spark.createDataFrame(api_json, schema=schema)
    display(df)
    

    For the schema, I used the one I had built from scratch in PySpark.