Search code examples
scalaapache-sparkapache-spark-sqlrdd

Create an array column from other columns after processing the column values


Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)

------------------------------------------------------------
StudentID  |  School |   Type        |  Group               
------------------------------------------------------------
1          |  ABC    |   Elementary  |  Music-Arts          
2          |  ABC    |   Elementary  |  Football            
3          |  DEF    |   Secondary   |  Basketball-Cricket  
4          |  DEF    |   Secondary   |  Cricket             
------------------------------------------------------------

I need to add one more column to the dataframe as below:

--------------------------------------------------------------------------------------
StudentID  |  School |   Type        |  Group               |  Combined Array
---------------------------------------------------------------------------------------
1          |  ABC    |   Elementary  |  Music-Arts          | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2          |  ABC    |   Elementary  |  Football            | ["School: ABC", "Type: Elementary", "Group: Football"]
3          |  DEF    |   Secondary   |  Basketball-Cricket  | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4          |  DEF    |   Secondary   |  Cricket             | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------

The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.

All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.

I am looking for the best performance solution.

If it's a simple array, it can be done with a single 'withColumn' transformation.

val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))

However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.


Solution

  • Using regex replacement to start of each field and to "-" in between:

    val df1 = spark.read.option("header","true").csv(filePath)
    val columns = List("School", "Type", "Group")
    var df2 = df1.withColumn("CombinedArray", array(columns.map{
       colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
    }:_*))