How could a PySpark RDD linear list be converted to a DataFrame?

I'd like to convert a linear list to a dataframe. i.e. given the following list,

a = ["a1", "a2", "a3", b1", "b2", "b3", "c1", "c2", "c3"]

Expected result is,

+--------------------+
| col1 | col2 | col3 |
+--------------------+
|  a1  |  a2  |  a3  |
|  b1  |  b2  |  b3  |
|  c1  |  c2  |  c3  |
+--------------------+

I tried the following but got an error.

from pyspark.sql.types import *

a = ["a1", "a2", "a3", "b1", "b2", "b3", "c1", "c2", "c3"]

rdd = sc.parallelize(a)

schema = StructType([
     StructField("a", StringType(), True),
     StructField("b", StringType(), True),
     StructField("c", StringType(), True)
     ])

df = sqlContext.createDataFrame(rdd, schema)

df.show()

The last show() statement gets an error "Job aborted due to stage failure". Please someone tell me the solution? Thanks.

Solution

Based on your comment, I presume that you start with the rdd and not the list.

I further assume that you are determining order based on the index of the rdd. If these assumptions are correct, you can use zipWithIndex() to add a row number to each record.

Then divide the row number by 3 (use integer division) to group every 3 consecutive records. Next use groupByKey() to aggregate the records with the same key into a tuple.

Finally, drop the key and call toDF()

rdd.zipWithIndex()\
    .map(lambda row: (row[1]//3, row[0]))\
    .groupByKey()\
    .map(lambda row: tuple(row[1]))\
    .toDF(["a", "b", "c"])\
    .show()
#+---+---+---+
#|  a|  b|  c|
#+---+---+---+
#| a1| a2| a3|
#| c1| c2| c3|
#| b1| b2| b3|
#+---+---+---+