Search code examples
apache-sparkapache-spark-sql

spark dataframe: explode list column


I've got an output from Spark Aggregator which is List[Character]

case class Character(name: String, secondName: String, faculty: String)
val charColumn = HPAggregator.toColumn
val resultDF = someDF.select(charColumn)

So my dataframe looks like:

+-----------------------------------------------+
|               value                           |
+-----------------------------------------------+
|[[harry, potter, gryffindor],[ron, weasley ... |
+-----------------------------------------------+

Now I want to convert it to

+----------------------------------+
| name  | second_name | faculty    |
+----------------------------------+
| harry | potter      | gryffindor |
| ron   | weasley     | gryffindor |

How can I do that properly?


Solution

  • This can be done using Explode and Split Dataframe functions.

    Below is an example:

    >>> df = spark.createDataFrame([[[['a','b','c'], ['d','e','f'], ['g','h','i']]]],["col1"])
    >>> df.show(20, False)
    +---------------------------------------------------------------------+
    |col1                                                                 |
    +---------------------------------------------------------------------+
    |[WrappedArray(a, b, c), WrappedArray(d, e, f), WrappedArray(g, h, i)]|
    +---------------------------------------------------------------------+
    
    >>> from pyspark.sql.functions import explode
    >>> out_df = df.withColumn("col2", explode(df.col1)).drop('col1')
    >>>
    >>> out_df .show()
    +---------+
    |     col2|
    +---------+
    |[a, b, c]|
    |[d, e, f]|
    |[g, h, i]|
    +---------+
    
    >>> out_df.select(out_df.col2[0].alias('c1'), out_df.col2[1].alias('c2'), out_df.col2[2].alias('c3')).show()
    +---+---+---+
    | c1| c2| c3|
    +---+---+---+
    |  a|  b|  c|
    |  d|  e|  f|
    |  g|  h|  i|
    +---+---+---+
    
    >>>