Search code examples
scalaapache-sparkhiveapache-spark-sql

Iterate though Columns of a Spark Dataframe and update specified values


To iterate through columns of a Spark Dataframe created from Hive table and update all occurrences of desired column values, I tried the following code.

import org.apache.spark.sql.{DataFrame}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.udf

val a: DataFrame = spark.sql(s"select * from default.table_a")

    val column_names: Array[String] = a.columns

    val required_columns: Array[String] = column_names.filter(name => name.endsWith("_date")) 

    val func = udf((value: String) => { if if (value == "XXXX" || value == "WWWW" || value == "TTTT") "NULL" else value } )

    val b = {for (column: String <- required_columns) { a.withColumn(column , func(a(column))) } a}

When executed the code in spark shell I got the following error.

scala> val b = {for (column: String <- required_columns) { a.withColumn(column , func(a(column))) } a}
<console>:35: error: value a is not a member of org.apache.spark.sql.DataFrame
       val b = {for (column: String <- required_column_list) { a.withColumn(column , isNull(a(column))) } a}
                                                                                                          ^ 

Also I tried the following statement and didn't get required output.

val b = for (column: String <- required_columns) { a.withColumn(column , func(a(column))) }

The variable b is created a Unit instead of Dataframe.

scala> val b = for (column: String <- required_columns) { a.withColumn(column , func(a(column))) }
    b: Unit = ()

Please suggest any better way to iterate through the columns of Dataframe and update all occurances of values from columns or correct where I am wrong. Any other solution is also appreciated. Thanks in advance.


Solution

  • Instead of for loop, you should go with foldLeft. And you don't need a udf function, when inbuilt function can be used

    val column_names: Array[String] = a.columns
    
    val required_columns: Array[String] = column_names.filter(name => name.endsWith("_date"))
    
    import org.apache.spark.sql.functions._
    val b = required_columns.foldLeft(a){(tempdf, colName) => tempdf.withColumn(colName, when(col(colName) === "XXX" || col(colName) === "WWWW" || col(colName) === "TTTT", "NULL").otherwise(col(colName)))}
    

    I hope the answer is helpful

    Explanation:

    In
    required_columns.foldLeft(a){(tempdf, colName) => tempdf.withColumn(colName, when(col(colName) === "XXX" || col(colName) === "WWWW" || col(colName) === "TTTT", "NULL").otherwise(col(colName)))}

    required_columns is an array of column names from a dataframe/dataset with _date as ending string, which are the colName inside withColumn

    tempdf is the original dataframe/dataset i.e. a

    when function is applied inside withColumn which replaces all XXX or WWWWW or TTTT values to NULL

    finally foldLeft returns all the transformations applied dataframe to b