Search code examples
pythonpandasdataframepysparkazure-databricks

Convert XML data to pandas dataframe via pyspark.sql.dataframe


My background: long-time SAS and R user, trying to figure out how to do some elementary things in Azure Databricks using Python and Spark. Sorry for the lack of a reproducible example below; I'm not sure how to create one like this.

I'm trying to read data from a complicated XML file. I've reached this point, where I have a pyspark.sql.dataframe (call it xml1) with this arrangement:

RESPONSE:array
  element:array
    element:struct
      VALUE:string
      VARNAME:string

The xml1 dataframe looks like this:

[Row(RESPONSE=[[Row(VALUE='No', VARNAME='PROV_U'), Row(VALUE='Included', VARNAME='ADJSAMP'), Row(VALUE='65', VARNAME='AGE'), ...

When I use xml2=xml1.toPandas(), I get this:

                      RESPONSE
0   [[(No, PROV_U), (Included, ADJSAMP), (65, AGE)...
1   [[(Included, ADJSAMP), (71, AGE), ...
...

At a minimum, I would like to convert this to a Pandas dataframe with two columns VARNAME and VALUE. A better solution would be a dataframe with columns named with VARNAME values (such as PROV_U, ADJSAMP, AGE), with one row per RESPONSE. Helpful hints with names of correct Python terms in intermediate steps are appreciated!


Solution

  • To deal with array of structs explode is your answer. Here is link on how to use explode https://hadoopist.wordpress.com/2016/05/16/how-to-handle-nested-dataarray-of-structures-or-multiple-explodes-in-sparkscala-and-pyspark/