Search code examples
pythonapache-sparkpyspark

How to convert a dictionary to dataframe in PySpark?


I am trying to convert a dictionary: data_dict = {'t1': '1', 't2': '2', 't3': '3'} into a dataframe:

key   |   value|
----------------
t1          1
t2          2
t3          3

To do that, I tried:

schema = StructType([StructField("key", StringType(), True), StructField("value", StringType(), True)])
ddf = spark.createDataFrame(data_dict, schema)

But I got the below error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/session.py", line 748, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/session.py", line 413, in _createFromLocal
    data = list(data)
  File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/session.py", line 730, in prepare
    verify_func(obj)
  File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/types.py", line 1389, in verify
    verify_value(obj)
  File "/usr/local/Cellar/apache-spark/2.4.5/libexec/python/pyspark/sql/types.py", line 1377, in verify_struct
    % (obj, type(obj))))
TypeError: StructType can not accept object 't1' in type <class 'str'>

So I tried this without specifying any schema but just the column datatypes: ddf = spark.createDataFrame(data_dict, StringType() & ddf = spark.createDataFrame(data_dict, StringType(), StringType())

But both result in a dataframe with one column which is key of the dictionary as below:

+-----+
|value|
+-----+
|t1   |
|t2   |
|t3   |
+-----+

Could anyone let me know how to convert a dictionary into a spark dataframe in PySpark ?


Solution

  • You can use data_dict.items() to list key/value pairs:

    spark.createDataFrame(data_dict.items()).show()
    

    Which prints

    +---+---+
    | _1| _2|
    +---+---+
    | t1|  1|
    | t2|  2|
    | t3|  3|
    +---+---+
    

    Of course, you can specify your schema:

    spark.createDataFrame(data_dict.items(), 
                          schema=StructType(fields=[
                              StructField("key", StringType()), 
                              StructField("value", StringType())])).show()
    

    Resulting in

    +---+-----+
    |key|value|
    +---+-----+
    | t1|    1|
    | t2|    2|
    | t3|    3|
    +---+-----+