I am trying to convert the below pipelined RDD into dataframe.
Pipelined RDD -> user_rdd
['new_user1',
'new_user2',
'Onlyknows',
'Icetea',
'_coldcoffee_']
I tried to convert using the below code
schema = StructType([StructField('Username', StringType(), True)])
user_df = sqlContext.createDataFrame(user_rdd,schema)
mention_df.show(20)
I am getting the below error:
ValueError: Unexpected tuple 'new_user1' with StructType
I tried using toDF() also:
user_df=user_rdd.toDF()
This time the error encountered is:
TypeError: Can not infer schema for type: <type 'str'>
Let me know if there is a way to convert this to dataframe using pyspark.
The rdd you have is a list of strings, which is essentially 1d data; A data frame requires 2d data; Convert each element in the rdd to a tuple should resolve the issue:
user_df = sqlContext.createDataFrame(user_rdd.map(lambda x: (x,)), schema)
# ^^^^^^^^^^^^^^^^^^^
user_df.show()
+------------+
| Username|
+------------+
| new_user1|
| new_user2|
| Onlyknows|
| Icetea|
|_coldcoffee_|
+------------+