Search code examples
scalaapache-sparkfold

Error on fold when reducing in Spark / Scala


Given that I have a dataframe with some columns:

Why does this not work?

val output3b = input.withColumn("sum", columnsToConcat.foldLeft(0)((x,y)=>(x+y)))

notebook:16: error: overloaded method value + with alternatives:
 (x: Int)Int <and>
 (x: Char)Int <and>
 (x: Short)Int <and>
 (x: Byte)Int
cannot be applied to (org.apache.spark.sql.Column)
val output3b = input.withColumn("sum", columnsToConcat.foldLeft(0)((x,y)=>(x+y))) // does work
                                                                           ^
notebook:16: error: type mismatch;
found   : Int
required: org.apache.spark.sql.Column
val output3b = input.withColumn("sum", columnsToConcat.foldLeft(0)((x,y)=>(x+y)))  

But this does?

val output3a = input.withColumn("concat", columnsToConcat.foldLeft(lit(0))((x,y)=>(x+y)))

Using the famous lit function seems to smooth a few things, but I am not sure why.

+---+----+----+----+----+----+------+
| ID|var1|var2|var3|var4|var5|concat|
+---+----+----+----+----+----+------+
|  a|   5|   7|   9|  12|  13|  46.0|
+---+----+----+----+----+----+------+

Solution

  • Prerequisites:

    • Based on the compiler message and the API usage we can deduce that columnsToConcat is a Seq[o.a.s.sql.Column] or equivalent.

    • By convention foldLeft methods require function that maps to the accumulator (initial value). Here's Seq.foldLeft signature

       def foldLeft[B](z: B)(op: (B, A) ⇒ B): B 
      
    • + in Scala is a method, specifically a syntactic sugar for .+ call.

    It means that in case of:

    columnsToConcat.foldLeft(0)((x,y)=>(x+y))
    

    is

    columnsToConcat.foldLeft(0)((x: Int, y: Column) => x + y)
    

    and you're asking for + method of Int (inferred type of the accumulator - 0), and since Int - and there is no + (org.apache.spark.sql.Column) => Int method for Int (the error already lists available methods, and it is hardly unexpected that such method doesn't exist), nor there exist, in the current scope, an implicit conversion from Int to any type that provides such method.

    In the second case you're asking

    columnsToConcat.foldLeft(lit(0))((x,y)=>(x+y))
    

    is

    columnsToConcat.foldLeft(lit(0))((x: Column, y: Column) => x + y)
    

    and + refers to Column.+ (as type of lit(0) is Column) and such method, which acceptsAny and returns Column, exists. Since Column <: Any type constraints are satisfied