Search code examples
pysparkazure-databricks

Json response contains columns and rows seperately


My json response contains columns and rows seperately. How to parse following data in pyspark (mapping the columns with rows)

Response is as follows, Response = "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"

Please help me how to parse it using PySpark.


Solution

  • Since you have your json as a string stored in Response i.e.,

    Response =  "{"columns": [{ "id": { "type": "Numeric", "nullable": false } },{ "name": { "type": "Text", "nullable": false } },{ "description": { "type": "Text", "nullable": true } },{ "last_updated": { "type": "DateTime", "nullable": false } }], "rows": [[1, "foo", "Lorem ipsum", "2016-10-26T00:09:14Z"],[4, "bar", null, "2013-07-01T13:04:24Z"]]}"
    
    • Use json.loads() to create a dictionary from this Response(string).
    import json
    
    data_dict = json.loads(Response)
    print(data_dict.keys())
    
    # dict_keys(['columns', 'rows'])
    
    • Retrieve row data (data for dataframe) and column data (create schema for dataframe) as shown below:
    rows = data_dict['rows'] #input data for creating dataframe
    print(rows)
    #[[1, 'foo', 'Lorem ipsum', '2016-10-26T00:09:14Z'], [4, 'bar', None, '2013-07-01T13:04:24Z']]
    
    cols = data_dict['columns']
    l = [i for column in cols for i in column.items()]
    
    #schema_str to create a schema (as a string) using response column data
    schema_str = "StructType(["
    
    convert = [] 
    #list of columns that should be as DateTime. 
    #First create dataframe as StringType column for these and then convert each column..
    #in convert (list) to TimestampType.
    
    for c in l:
    
        #column name
        col_name = c[0]   
        
        #column type
        if(c[1]['type'] =='Numeric'):
            col_type = 'IntegerType()'
        elif(c[1]['type'] == 'Text'):
            col_type = 'StringType()'
        elif(c[1]['type'] == 'DateTime'):
            #converting datetime type to StringType, to later convert to TimestampType
            col_type = 'StringType()'
            convert.append(col_name) #appending columns to be converted to a list
        
        #if column is nullable or not
        col_nullable = c[1]['nullable']
        
        
        schema_str = schema_str+f'StructField("{col_name}",{col_type},{col_nullable}),'
        
    schema_str = schema_str[:-1] + '])'
    
    print(schema_str)
    #StructType([StructField("id",IntegerType(),False),StructField("name",StringType(),False),StructField("description",StringType(),True),StructField("last_updated",StringType(),False)])
    
    • Now use row data and above schema(string) to create dataframe
    from pyspark.sql.types import *
    from pyspark.sql.functions import *
    
    df = spark.createDataFrame(data=rows, schema=eval(schema_str))
    df.show()
    df.printSchema()
    
    #output
    
    +---+----+-----------+--------------------+
    | id|name|description|        last_updated|
    +---+----+-----------+--------------------+
    |  1| foo|Lorem ipsum|2016-10-26T00:09:14Z|
    |  4| bar|       null|2013-07-01T13:04:24Z|
    +---+----+-----------+--------------------+
    
    root
     |-- id: integer (nullable = false)
     |-- name: string (nullable = false)
     |-- description: string (nullable = true)
     |-- last_updated: string (nullable = false)
    
    • Convert the columns that need to be type casted to TimestampType()
    for to_be_converted in convert:
        df = df.withColumn(to_be_converted, to_timestamp(to_be_converted).cast(TimestampType()))
    df.show()
    df.printSchema()
    
    #output
    
    +---+----+-----------+-------------------+
    | id|name|description|       last_updated|
    +---+----+-----------+-------------------+
    |  1| foo|Lorem ipsum|2016-10-26 00:09:14|
    |  4| bar|       null|2013-07-01 13:04:24|
    +---+----+-----------+-------------------+
    
    
    root
     |-- id: integer (nullable = false)
     |-- name: string (nullable = false)
     |-- description: string (nullable = true)
     |-- last_updated: timestamp (nullable = true)