Search code examples
pythondictionaryapache-sparkdataframerdd

PySpark RDD to dataframe with list of tuple and dictionary


I have processed some data in pyspark and it is an RDD that has this structure

[(u'991', {'location': 'Australia', 'Age': '27', 'Colour': Pink}), (u'993', {'location': 'Singapore', 'Age': '55', 'Colour': Black}), (u'993', {'location': 'Mexico', 'Age': '12', 'Colour': Blue}), (u'994', {'location': 'USA', 'Age': '24', 'Colour': Red})]

How do I convert this structure into a Dataframe? My end goal is that I can store a hive table, with 4 columns (ID (i.e. 991,), Location, Age, Colour)

The Row solution does not seem to work given that the dictionary is within a tuple


Solution

  • Convert each tuple to a Row object and then call toDF method; Row(ID=t[0], **t[1]) pass the dictionary in the tuple as keyword arguments to each row, and use ID = t[0] to create a new key value pair with ID as the key:

    from pyspark.sql import Row
    rdd.map(lambda t: Row(ID=t[0], **t[1])).toDF().show()
    +---+------+---+---------+
    |Age|Colour| ID| location|
    +---+------+---+---------+
    | 27|  Pink|991|Australia|
    | 55| Black|993|Singapore|
    | 12|  Blue|993|   Mexico|
    | 24|   Red|994|      USA|
    +---+------+---+---------+