Search code examples
apache-spark-sqlhadoop2

I want to find max value comparing 100 columns with data frame


I have a dataframe

syr | P1    | P2
-----------------
1   | 200   | 300
2   | 500   | 700
3   | 900   | 400

I want to create another DataFrame which has max value between col2 & col3. An expected output is like:

syr | P1    | P2    | max
-------------------------
1   | 200   | 300   | 300
2   | 500   | 700   | 700
3   | 900   | 400   | 900

Solution

  • You could define a new UDF function to catch the max value between two column, like:

    def maxDef(p1: Int, p2: Int): Int = if(p1>p2) p1 else p2
    val max = udf[Int, Int, Int](maxDef)
    

    And then apply the UDF in a withColumn() to define a new Column, like:

    val df1 = df.withColumn("max", max(df.col("P1"), df.col("P2")))
    +---+---+---+---+
    |syr| P1| P2|max|
    +---+---+---+---+
    |  1|200|300|300|
    |  2|500|700|700|
    |  3|900|400|900|
    +---+---+---+---+
    

    EDIT: Iterate through columns

    First initialize the max Column:

    df = df.withColumn("max", lit(0))
    

    then foreach Column you want (use filter function property) compare it with the max Column.

    df.columns.filter(_.startsWith("P")).foreach(col => {
      df = df.withColumn("max", max(df.col("max"), df.col(col)))
    })