Search code examples
apache-spark

New Dataframe columns from column of arrays


I have this Dataframe :

+---------+
|     data|
+---------+
|[a, b, c]|
|[d, e, f]|
|[g, h, i]|
+---------+

And a list of column name ["first col", "second col", "third col"]

I want to create new columns to produce the following dataframe :

+-----------+-----------+----------+
|  first col| second col| third col|
+-----------+-----------+----------+
|          a|          b|         c|
|          d|          e|         f|
|          g|          h|         i|
+-----------+-----------+----------+

I'm scratching my head on how to do that, what would be the correct way to achieve this?


Solution

  • Untested code but the idea is to just use getItem() to access the ith element of the data column which in your case is a list, and store them in new columns created with withColumn

    
    df = spark.createDataFrame([(['a', 'b', 'c'],), (['d', 'e', 'f'],), (['g', 'h', 'i'],)], ['data'])
    col_names = ['first col', 'second col', 'third col']
    
    for i, name in enumerate(col_names):
        df = df.withColumn(name, col('data').getItem(i))
    
    df = df.drop('data')