Search code examples
javaapache-sparkspark-java

How to create a struct column from a list of column names in Spark with Java?


I have a DataFrame with multiple columns, e.g.

root
 |-- playerName
 |-- country
 |-- bowlingAvg
 |-- bowlingSR
 |-- wickets
 |-- battingAvg
 |-- battingSR
 |-- runs

I also have a list of the column names which corresponds to bowling stats:

List bowlingParams = new ArrayList(Arrays.asList("bowlingAvg", "bowlingSR", "wickets"));

Expected Schema:

root
 |-- playerName
 |-- country
 |-- bowlingAvg
 |-- bowlingSR
 |-- wickets
 |-- battingAvg
 |-- battingSR
 |-- runs
 |-- bowlingStats 
       |-- bowlingAvg
       |-- bowlingSR
       |-- wickets

I can do it like this

playerDF = playerDF.withColumn("bowlingStats", functions.struct("bowlingAvg", "bowlingSR", "wickets"))

However, I want to use the list to dynamically select the column for struct.

I know we can do it like this in Scala

playerDF = playerDF.select(struct(bowlingParams.map(col): _*))

and, I have also found a reference on how to do this in Python

Is there a way we can do this in Java with Spark?


Solution

  • For java this solution worked for me,

    • remove the one attribute from list(non dynamic one)

    • convert the remaining list to Scala Sequence using JavaConverters.

    • when creating nested column , in struct use one attribute(as string) and your converted Scala Seq.

       import scala.collection.JavaConverters; 
      
       List bowlingParams = new ArrayList(Arrays.asList("bowlingSR", "wickets"));
      
      
      playerDF = playerDF.withColumn("bowlingStats", functions.struct("bowlingAvg",JavaConverters.asScalaIteratorConverter(bowlingParams.iterator()).asScala().toSeq()));