Search code examples
restpysparkazure-databricks

PySpark query REST API without knowing scheme


How can I format the PySpark struct types to reflect the following API json result? This is the complete result returned by the API call. Please note, the coordinates element has an unknown dynamic length prior to obtaining the json.

{'type': 'Polygon',
 'coordinates': [[[-74.53703811195342, 43.93214895162186],
   [-74.53765823498132, 43.932511606633376],
   [-74.53790321887529, 43.933052813967755],
...
   
   [-74.53653908240167, 43.93223696479821],
   [-74.53703811195342, 43.93214895162186]]]}

My attempt:

# set up UDF
schema = StructType([
    StructField("type", StringType()),
      StructField("coordinates", StringType()) # I just put a random StringType
])

# then after the API query,
result_df = request_df \
       .withColumn("result", udf_executeRestApi(col("body")))
df = result_df.select([col for col in result_df.columns])
df.show()

and got:

It's showing the Polygon object as expected, but formatted wrong. And the predefined type and coordinates are not recognised here.

Here is the result: enter image description here


Solution

  • You can use below schema for you REST Api response.

    schema = StructType([
        StructField("type", StringType()),
        StructField("coordinates", ArrayType(ArrayType(ArrayType(DoubleType()))))]) 
    
    df = spark.createDataFrame([json_data], schema=schema)
    display(df)
    

    Output:

    Also you need to check what your udf returns. Give above schema as return type while registering udf and return your json data directly.If you provide your udf function it will help me to resolve this clearly.