Search code examples
apache-sparkdataframepysparkrdd

How could a PySpark RDD linear list be converted to a DataFrame?


I'd like to convert a linear list to a dataframe. i.e. given the following list,

a = ["a1", "a2", "a3", b1", "b2", "b3", "c1", "c2", "c3"]

Expected result is,

+--------------------+
| col1 | col2 | col3 |
+--------------------+
|  a1  |  a2  |  a3  |
|  b1  |  b2  |  b3  |
|  c1  |  c2  |  c3  |
+--------------------+

I tried the following but got an error.

from pyspark.sql.types import *

a = ["a1", "a2", "a3", "b1", "b2", "b3", "c1", "c2", "c3"]

rdd = sc.parallelize(a)

schema = StructType([
     StructField("a", StringType(), True),
     StructField("b", StringType(), True),
     StructField("c", StringType(), True)
     ])

df = sqlContext.createDataFrame(rdd, schema)

df.show()

The last show() statement gets an error "Job aborted due to stage failure". Please someone tell me the solution? Thanks.


Solution

  • Based on your comment, I presume that you start with the rdd and not the list.

    I further assume that you are determining order based on the index of the rdd. If these assumptions are correct, you can use zipWithIndex() to add a row number to each record.

    Then divide the row number by 3 (use integer division) to group every 3 consecutive records. Next use groupByKey() to aggregate the records with the same key into a tuple.

    Finally, drop the key and call toDF()

    rdd.zipWithIndex()\
        .map(lambda row: (row[1]//3, row[0]))\
        .groupByKey()\
        .map(lambda row: tuple(row[1]))\
        .toDF(["a", "b", "c"])\
        .show()
    #+---+---+---+
    #|  a|  b|  c|
    #+---+---+---+
    #| a1| a2| a3|
    #| c1| c2| c3|
    #| b1| b2| b3|
    #+---+---+---+