Search code examples
scalaapache-spark

How to populate default value for a missing key in Json in Scala Dataframe?


I am reading json data in my scala profile. I am trying to pull customerid,location,city,state and status(status should be inside the address key). Since status is an optional key, it might not come in the json data for all record. In that case when i try to refer to get status it fails stating schema error. How can i read the a key even if it is not present in the json with some default value?

status key is supposed to be inside "address"
    [{
            "customerid": 123,
            "location": "NA",
            "address": {
                "city": "seattle",
                "state": "washington"
            }
        },
        {
            "customerid": 124,
            "location": "NA",
            "address": {
                "city": "seattle",
                "state": "washington"
            }
        }
    ]

output
customeid,location,city,state,status

Solution

  • Check below code.

    scala> df.show(false)
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |input                                                                                                                                                                                                                                                                                                                                                                                                            |
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[{\n            "customerid": 123,\n            "location": "NA",\n            "address": {\n                "city": "seattle",\n                "state": "washington"\n            }\n        },\n        {\n            "customerid": 124,\n            "location": "NA",\n            "address": {\n                "city": "seattle",\n                "state": "washington"\n            }\n        }\n    ]|
    +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    scala> val columns = Seq(
       "customerid", 
       "location", 
       "address['city'] AS city", 
       "address['state'] AS state", 
       "IFNULL(address['status'], 'N/A') AS status"
    ) // 'N/A' is default value for status, you can change as per your need.
    
    val colExprs = """named_struct(
        'customerid',
        in['customerid'],
        'location',
        in['location'],
        'address',
        from_json(in['address'], 'map<string, string>')
    )"""
    
    val jsonExprs = "from_json(input, 'array<map<string,string>>')"
    
    df
    .withColumn("input", expr(s"""transform(${jsonExprs}, in -> ${colExprs})"""))
    .selectExpr("inline(input)")
    .selectExpr(columns:_*)
    .show(false)
    
    +----------+--------+-------+----------+------+
    |customerid|location|city   |state     |status|
    +----------+--------+-------+----------+------+
    |123       |NA      |seattle|washington|NULL  |
    |124       |NA      |seattle|washington|NULL  |
    +----------+--------+-------+----------+------+