Search code examples
pythonapache-sparkdataframepysparkrdd

Error while converting pipelined RDD to Dataframe in pyspark


I am trying to convert the below pipelined RDD into dataframe.

Pipelined RDD -> user_rdd

['new_user1',
 'new_user2',
 'Onlyknows',
 'Icetea',
 '_coldcoffee_']

I tried to convert using the below code

schema = StructType([StructField('Username', StringType(), True)])
user_df = sqlContext.createDataFrame(user_rdd,schema)
mention_df.show(20)

I am getting the below error:

ValueError: Unexpected tuple 'new_user1' with StructType

I tried using toDF() also:

user_df=user_rdd.toDF()

This time the error encountered is:

TypeError: Can not infer schema for type: <type 'str'>

Let me know if there is a way to convert this to dataframe using pyspark.


Solution

  • The rdd you have is a list of strings, which is essentially 1d data; A data frame requires 2d data; Convert each element in the rdd to a tuple should resolve the issue:

    user_df = sqlContext.createDataFrame(user_rdd.map(lambda x: (x,)), schema)
    #                                             ^^^^^^^^^^^^^^^^^^^  
    user_df.show()
    +------------+
    |    Username|
    +------------+
    |   new_user1|
    |   new_user2|
    |   Onlyknows|
    |      Icetea|
    |_coldcoffee_|
    +------------+