python dictionary apache-spark dataframe rdd

PySpark RDD to dataframe with list of tuple and dictionary

I have processed some data in pyspark and it is an RDD that has this structure

[(u'991', {'location': 'Australia', 'Age': '27', 'Colour': Pink}), (u'993', {'location': 'Singapore', 'Age': '55', 'Colour': Black}), (u'993', {'location': 'Mexico', 'Age': '12', 'Colour': Blue}), (u'994', {'location': 'USA', 'Age': '24', 'Colour': Red})]

How do I convert this structure into a Dataframe? My end goal is that I can store a hive table, with 4 columns (ID (i.e. 991,), Location, Age, Colour)

The Row solution does not seem to work given that the dictionary is within a tuple

Solution

Convert each tuple to a Row object and then call toDF method; Row(ID=t[0], **t[1]) pass the dictionary in the tuple as keyword arguments to each row, and use ID = t[0] to create a new key value pair with ID as the key:

from pyspark.sql import Row
rdd.map(lambda t: Row(ID=t[0], **t[1])).toDF().show()
+---+------+---+---------+
|Age|Colour| ID| location|
+---+------+---+---------+
| 27|  Pink|991|Australia|
| 55| Black|993|Singapore|
| 12|  Blue|993|   Mexico|
| 24|   Red|994|      USA|
+---+------+---+---------+