Json response contains columns and rows seperately

My json response contains columns and rows seperately. How to parse following data in pyspark (mapping the columns with rows)

Response is as follows, Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"

Please help me how to parse it using PySpark.

Solution

Since you have your json as a string stored in Response i.e.,

Response =  "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"

Use json.loads() to create a dictionary from this Response(string).

import json

data_dict = json.loads(Response)
print(data_dict.keys())

# dict_keys(['columns', 'rows'])

Retrieve row data (data for dataframe) and column data (create schema for dataframe) as shown below:

rows = data_dict['rows'] #input data for creating dataframe
print(rows)
#[[1, 'foo', 'Lorem ipsum', '2016-10-26T00:09:14Z'], [4, 'bar', None, '2013-07-01T13:04:24Z']]

cols = data_dict['columns']
l = [i for column in cols for i in column.items()]

#schema_str to create a schema (as a string) using response column data
schema_str = "StructType(["

convert = [] 
#list of columns that should be as DateTime. 
#First create dataframe as StringType column for these and then convert each column..
#in convert (list) to TimestampType.

for c in l:

    #column name
    col_name = c[0]   
    
    #column type
    if(c[1]['type'] =='Numeric'):
        col_type = 'IntegerType()'
    elif(c[1]['type'] == 'Text'):
        col_type = 'StringType()'
    elif(c[1]['type'] == 'DateTime'):
        #converting datetime type to StringType, to later convert to TimestampType
        col_type = 'StringType()'
        convert.append(col_name) #appending columns to be converted to a list
    
    #if column is nullable or not
    col_nullable = c[1]['nullable']
    
    
    schema_str = schema_str+f'StructField("{col_name}",{col_type},{col_nullable}),'
    
schema_str = schema_str[:-1] + '])'

print(schema_str)
#StructType([StructField("id",IntegerType(),False),StructField("name",StringType(),False),StructField("description",StringType(),True),StructField("last_updated",StringType(),False)])

Now use row data and above schema(string) to create dataframe

from pyspark.sql.types import *
from pyspark.sql.functions import *

df = spark.createDataFrame(data=rows, schema=eval(schema_str))
df.show()
df.printSchema()

#output

+---+----+-----------+--------------------+
| id|name|description|        last_updated|
+---+----+-----------+--------------------+
|  1| foo|Lorem ipsum|2016-10-26T00:09:14Z|
|  4| bar|       null|2013-07-01T13:04:24Z|
+---+----+-----------+--------------------+

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- description: string (nullable = true)
 |-- last_updated: string (nullable = false)

Convert the columns that need to be type casted to TimestampType()

for to_be_converted in convert:
    df = df.withColumn(to_be_converted, to_timestamp(to_be_converted).cast(TimestampType()))
df.show()
df.printSchema()

#output

+---+----+-----------+-------------------+
| id|name|description|       last_updated|
+---+----+-----------+-------------------+
|  1| foo|Lorem ipsum|2016-10-26 00:09:14|
|  4| bar|       null|2013-07-01 13:04:24|
+---+----+-----------+-------------------+


root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = false)
 |-- description: string (nullable = true)
 |-- last_updated: timestamp (nullable = true)