How can I format the PySpark struct types to reflect the following API json result? This is the complete result returned by the API call. Please note, the coordinates
element has an unknown dynamic length prior to obtaining the json.
{'type': 'Polygon',
'coordinates': [[[-74.53703811195342, 43.93214895162186],
[-74.53765823498132, 43.932511606633376],
[-74.53790321887529, 43.933052813967755],
...
[-74.53653908240167, 43.93223696479821],
[-74.53703811195342, 43.93214895162186]]]}
My attempt:
# set up UDF
schema = StructType([
StructField("type", StringType()),
StructField("coordinates", StringType()) # I just put a random StringType
])
# then after the API query,
result_df = request_df \
.withColumn("result", udf_executeRestApi(col("body")))
df = result_df.select([col for col in result_df.columns])
df.show()
and got:
It's showing the Polygon object as expected, but formatted wrong. And the predefined type
and coordinates
are not recognised here.
You can use below schema for you REST Api response.
schema = StructType([
StructField("type", StringType()),
StructField("coordinates", ArrayType(ArrayType(ArrayType(DoubleType()))))])
df = spark.createDataFrame([json_data], schema=schema)
display(df)
Output:
Also you need to check what your udf returns. Give above schema as return type while registering udf and return your json data directly.If you provide your udf function it will help me to resolve this clearly.