I wanted to define schema for my structured streaming job (in python) but I am not able to get the dataframe schema the way I wanted.
For the this json
{
"messages": [{
"IdentityNumber": 1,
"body": {
"Alert": "This is the payload"
},
"regionNumber": 11000002
}]
}
I am using below code to as a schema
schema1 = StructType([StructField("messages", ArrayType(
StructType(
[
StructField("body", StructType( [StructField("Alert", StringType())]) )
]
)
,True))])
But I am getting my schema as
df-> messages -> body-> Alert
While I wanted something like this
df-> Alert
i.e. A dataframe with single column named alert which will contain all the strings messages present as alert. What change should I make in my defined schema ?
The schema is ok if your are reading data respecting this schema.
If you need to extract nested fields after you read the json in the above schema, just use dot notation. For example:
df.select(col("messages[0].body.alert"))
If you need to operate and explode all the array elements, check this article that explains the difference options you have to do it: https://docs.databricks.com/_static/notebooks/transform-complex-data-types-scala.html
The above answer is in scala as is the article, but most spark sql APIs are easily transferable to pySpark.