python pandas dataframe pyspark azure-databricks

Convert XML data to pandas dataframe via pyspark.sql.dataframe

My background: long-time SAS and R user, trying to figure out how to do some elementary things in Azure Databricks using Python and Spark. Sorry for the lack of a reproducible example below; I'm not sure how to create one like this.

I'm trying to read data from a complicated XML file. I've reached this point, where I have a pyspark.sql.dataframe (call it xml1) with this arrangement:

RESPONSE:array
  element:array
    element:struct
      VALUE:string
      VARNAME:string

The xml1 dataframe looks like this:

[Row(RESPONSE=[[Row(VALUE='No', VARNAME='PROV_U'), Row(VALUE='Included', VARNAME='ADJSAMP'), Row(VALUE='65', VARNAME='AGE'), ...

When I use xml2=xml1.toPandas(), I get this:

                      RESPONSE
0   [[(No, PROV_U), (Included, ADJSAMP), (65, AGE)...
1   [[(Included, ADJSAMP), (71, AGE), ...
...

At a minimum, I would like to convert this to a Pandas dataframe with two columns VARNAME and VALUE. A better solution would be a dataframe with columns named with VARNAME values (such as PROV_U, ADJSAMP, AGE), with one row per RESPONSE. Helpful hints with names of correct Python terms in intermediate steps are appreciated!

Solution

To deal with array of structs explode is your answer. Here is link on how to use explode https://hadoopist.wordpress.com/2016/05/16/how-to-handle-nested-dataarray-of-structures-or-multiple-explodes-in-sparkscala-and-pyspark/