Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)
------------------------------------------------------------
StudentID | School | Type | Group
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts
2 | ABC | Elementary | Football
3 | DEF | Secondary | Basketball-Cricket
4 | DEF | Secondary | Cricket
------------------------------------------------------------
I need to add one more column to the dataframe as below:
--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------
The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.
All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.
I am looking for the best performance solution.
If it's a simple array, it can be done with a single 'withColumn' transformation.
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))
However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.
Using regex replacement to start of each field and to "-" in between:
val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map{
colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
}:_*))