apache-spark pyspark spark-streaming spark-structured-streaming

Creating a Spark Structured Streaming Schema for nested Json

I wanted to define schema for my structured streaming job (in python) but I am not able to get the dataframe schema the way I wanted.

For the this json

{
    "messages": [{
        "IdentityNumber": 1,
        "body": {
            "Alert": "This is the payload"
        },
        "regionNumber": 11000002
    }]
}

I am using below code to as a schema

schema1 = StructType([StructField("messages", ArrayType(   
    StructType( 
        [
            StructField("body", StructType( [StructField("Alert", StringType())]) )
        ]
    )
    ,True))])

But I am getting my schema as

df-> messages -> body-> Alert

While I wanted something like this

df-> Alert

i.e. A dataframe with single column named alert which will contain all the strings messages present as alert. What change should I make in my defined schema ?

Solution

The schema is ok if your are reading data respecting this schema.

If you need to extract nested fields after you read the json in the above schema, just use dot notation. For example:

df.select(col("messages[0].body.alert"))

If you need to operate and explode all the array elements, check this article that explains the difference options you have to do it: https://docs.databricks.com/_static/notebooks/transform-complex-data-types-scala.html

The above answer is in scala as is the article, but most spark sql APIs are easily transferable to pySpark.