My background: long-time SAS and R user, trying to figure out how to do some elementary things in Azure Databricks using Python and Spark. Sorry for the lack of a reproducible example below; I'm not sure how to create one like this.
I'm trying to read data from a complicated XML file. I've reached this point, where I have a pyspark.sql.dataframe (call it xml1) with this arrangement:
RESPONSE:array
element:array
element:struct
VALUE:string
VARNAME:string
The xml1 dataframe looks like this:
[Row(RESPONSE=[[Row(VALUE='No', VARNAME='PROV_U'), Row(VALUE='Included', VARNAME='ADJSAMP'), Row(VALUE='65', VARNAME='AGE'), ...
When I use xml2=xml1.toPandas(), I get this:
RESPONSE
0 [[(No, PROV_U), (Included, ADJSAMP), (65, AGE)...
1 [[(Included, ADJSAMP), (71, AGE), ...
...
At a minimum, I would like to convert this to a Pandas dataframe with two columns VARNAME and VALUE. A better solution would be a dataframe with columns named with VARNAME values (such as PROV_U, ADJSAMP, AGE), with one row per RESPONSE. Helpful hints with names of correct Python terms in intermediate steps are appreciated!
To deal with array of structs explode is your answer. Here is link on how to use explode https://hadoopist.wordpress.com/2016/05/16/how-to-handle-nested-dataarray-of-structures-or-multiple-explodes-in-sparkscala-and-pyspark/