Search code examples
arraysstringscalaapache-sparkaccumulator

Spark scala get an array of type string from multiple columns


I am using spark with scala.

Imagine the input:

enter image description here

I would like to know how to get the following output [see the column accumulator on the following image] which should be a Array of type String Array[String]

enter image description here

In my real dataframe I have more than 3 columns. I have several thousand of column.

How can I proceed in order to get my desired output?


Solution

  • You can use an array function and map a sequence of columns:

    import org.apache.spark.sql.functions.{array, col, udf} 
    
    val tmp = array(df.columns.map(c => when(col(c) =!= 0, c)):_*)
    

    where

    when(col(c) =!= 0, c)
    

    takes a column name if column value is different than zero and null otherwise.

    and use an UDF to filter nulls:

    val dropNulls = udf((xs: Seq[String]) => xs.flatMap(Option(_)))
    df.withColumn("accumulator", dropNulls(tmp))
    

    So with example data:

    val df = Seq((1, 0, 1), (0, 1, 1), (1, 0, 0)).toDF("apple", "orange", "kiwi")
    

    you first get:

    +-----+------+----+--------------------+
    |apple|orange|kiwi|                 tmp|
    +-----+------+----+--------------------+
    |    1|     0|   1| [apple, null, kiwi]|
    |    0|     1|   1|[null, orange, kiwi]|
    |    1|     0|   0| [apple, null, null]|
    +-----+------+----+--------------------+
    

    and finally:

    +-----+------+----+--------------+
    |apple|orange|kiwi|   accumulator|
    +-----+------+----+--------------+
    |    1|     0|   1| [apple, kiwi]|
    |    0|     1|   1|[orange, kiwi]|
    |    1|     0|   0|       [apple]|
    +-----+------+----+--------------+