I'd like to convert a linear list to a dataframe. i.e. given the following list,
a = ["a1", "a2", "a3", b1", "b2", "b3", "c1", "c2", "c3"]
Expected result is,
+--------------------+
| col1 | col2 | col3 |
+--------------------+
| a1 | a2 | a3 |
| b1 | b2 | b3 |
| c1 | c2 | c3 |
+--------------------+
I tried the following but got an error.
from pyspark.sql.types import *
a = ["a1", "a2", "a3", "b1", "b2", "b3", "c1", "c2", "c3"]
rdd = sc.parallelize(a)
schema = StructType([
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("c", StringType(), True)
])
df = sqlContext.createDataFrame(rdd, schema)
df.show()
The last show() statement gets an error "Job aborted due to stage failure". Please someone tell me the solution? Thanks.
Based on your comment, I presume that you start with the rdd
and not the list.
I further assume that you are determining order based on the index of the rdd
. If these assumptions are correct, you can use zipWithIndex()
to add a row number to each record.
Then divide the row number by 3 (use integer division) to group every 3 consecutive records. Next use groupByKey()
to aggregate the records with the same key
into a tuple.
Finally, drop the key and call toDF()
rdd.zipWithIndex()\
.map(lambda row: (row[1]//3, row[0]))\
.groupByKey()\
.map(lambda row: tuple(row[1]))\
.toDF(["a", "b", "c"])\
.show()
#+---+---+---+
#| a| b| c|
#+---+---+---+
#| a1| a2| a3|
#| c1| c2| c3|
#| b1| b2| b3|
#+---+---+---+