Search code examples
scalaapache-sparknullcoalesce

Coalesce in spark scala


I am trying to understand if there is a default method available in Spark - scala to include empty strings in coalesce.

Ex- I have the below DF with me -

 val df2=Seq(
 ("","1"),
 ("null","15_20")
 )toDF("c1","c2")

+----+-----+
|  c1|   c2|
+----+-----+
|    |    1|
|null|15_20|
+----+-----+

The below code will work only for NULL values. But I require the coalesce to work for empty strings as well.

df2.withColumn("FirstNonNullOrBlank",coalesce(col("c1"),col("c2")))show
+----+-----+-------------------+
|  c1|   c2|FirstNonNullOrBlank|
+----+-----+-------------------+
|    |    1|                   |
|null|15_20|              15_20|
+----+-----+-------------------+

Expected Output -

+----+-----+-------------------+
|  c1|   c2|FirstNonNullOrBlank|
+----+-----+-------------------+
|    |    1|              1    |
|null|15_20|              15_20|
+----+-----+-------------------+

What should be the best approach here ?


Solution

  • you need a helper function to "nullify" these records :

    def nullify(c: Column) = when(not (c==="" or c==="null"),c)
    
    df2
      .withColumn("FirstNonNullOrBlank", coalesce(
        nullify(col("c1")),
        nullify(col("c2")))
      )
      .show
    
    +----+-----+-------------------+
    |  c1|   c2|FirstNonNullOrBlank|
    +----+-----+-------------------+
    |    |    1|                  1|
    |null|15_20|              15_20|
    +----+-----+-------------------+