python apache-spark dataframe pyspark rdd

Error while converting pipelined RDD to Dataframe in pyspark

I am trying to convert the below pipelined RDD into dataframe.

Pipelined RDD -> user_rdd

['new_user1',
 'new_user2',
 'Onlyknows',
 'Icetea',
 '_coldcoffee_']

I tried to convert using the below code

schema = StructType([StructField('Username', StringType(), True)])
user_df = sqlContext.createDataFrame(user_rdd,schema)
mention_df.show(20)

I am getting the below error:

ValueError: Unexpected tuple 'new_user1' with StructType

I tried using toDF() also:

user_df=user_rdd.toDF()

This time the error encountered is:

TypeError: Can not infer schema for type: <type 'str'>

Let me know if there is a way to convert this to dataframe using pyspark.

Solution

The rdd you have is a list of strings, which is essentially 1d data; A data frame requires 2d data; Convert each element in the rdd to a tuple should resolve the issue:

user_df = sqlContext.createDataFrame(user_rdd.map(lambda x: (x,)), schema)
#                                             ^^^^^^^^^^^^^^^^^^^  
user_df.show()
+------------+
|    Username|
+------------+
|   new_user1|
|   new_user2|
|   Onlyknows|
|      Icetea|
|_coldcoffee_|
+------------+