Search code examples
pythondataframeapache-sparkpysparkrdd

Convert a Pipeline RDD into a Spark dataframe


Starting from this:

items.take(2)
[['home', 'alone', 'apparently'], ['st','louis','plant','close','die','old','age','workers','making','cars','since','onset','mass','automotive','production','1920s']]

type(items)
pyspark.rdd.PipelinedRDD

I would like to convert it into a Spark dataframe with one column and a row for each list of words.


Solution

  • You can create a dataframe using toDF, but remember to wrap each list in a list first, so that Spark can understand that you have only one column for each row.

    df = items.map(lambda x: [x]).toDF(['words'])
    
    df.show(truncate=False)
    +------------------------------------------------------------------------------------------------------------------+
    |words                                                                                                             |
    +------------------------------------------------------------------------------------------------------------------+
    |[home, alone, apparently]                                                                                         |
    |[st, louis, plant, close, die, old, age, workers, making, cars, since, onset, mass, automotive, production, 1920s]|
    +------------------------------------------------------------------------------------------------------------------+
    
    df.printSchema()
    root
     |-- words: array (nullable = true)
     |    |-- element: string (containsNull = true)