I have processed some data in pyspark and it is an RDD that has this structure
[(u'991', {'location': 'Australia', 'Age': '27', 'Colour': Pink}), (u'993', {'location': 'Singapore', 'Age': '55', 'Colour': Black}), (u'993', {'location': 'Mexico', 'Age': '12', 'Colour': Blue}), (u'994', {'location': 'USA', 'Age': '24', 'Colour': Red})]
How do I convert this structure into a Dataframe? My end goal is that I can store a hive table, with 4 columns (ID (i.e. 991,), Location, Age, Colour)
The Row solution does not seem to work given that the dictionary is within a tuple
Convert each tuple to a Row object and then call toDF
method; Row(ID=t[0], **t[1])
pass the dictionary in the tuple as keyword arguments to each row, and use ID = t[0]
to create a new key value pair with ID
as the key:
from pyspark.sql import Row
rdd.map(lambda t: Row(ID=t[0], **t[1])).toDF().show()
+---+------+---+---------+
|Age|Colour| ID| location|
+---+------+---+---------+
| 27| Pink|991|Australia|
| 55| Black|993|Singapore|
| 12| Blue|993| Mexico|
| 24| Red|994| USA|
+---+------+---+---------+