Search code examples
scalaapache-sparklanguage-concepts

When to explicitly state types of inputs of functions?


I am able to calculate the mean word length per starting letter for the spark collection

val animals23 = sc.parallelize(List(("a","ant"), ("c","crocodile"), ("c","cheetah"), ("c","cat"), ("d","dolphin"), ("d","dog"), ("g","gnu"), ("l","leopard"), ("l","lion"), ("s","spider"), ("t","tiger"), ("w","whale")), 2)

either with

animals23.
    aggregateByKey((0,0))(
        (x, y) => (x._1 + y.length, x._2 + 1),
        (x, y) => (x._1 + y._1, x._2 + y._2)
    ).
    map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble)).
    collect

or with

animals23.
    combineByKey(
        (x:String) => (x.length,1),
        (x:(Int, Int), y:String) => (x._1 + y.length, x._2 + 1),
        (x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2)
    ).
    map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble)).
    collect

each resulting in

Array((a,3.0), (c,6.333333333333333), (d,5.0), (g,3.0), (l,5.5), (w,5.0), (s,6.0), (t,5.0))

What I do not understand: Why am I required to explicitly state the types in the functions in the second example while the first example's functions can do without?

I am talking about

(x, y) => (x._1 + y.length, x._2 + 1),
(x, y) => (x._1 + y._1, x._2 + y._2)

vs

(x:(Int, Int), y:String) => (x._1 + y.length, x._2 + 1),
(x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2)

and it might be more a Scala than a Spark question.


Solution

  • Why am I required to explicitly state the types in the functions in the second example while the first example's functions can do without?

    Because in the first example, the compiler is able to infer the type of seqOp based on the first argument list supplied. aggregateByKey is using currying:

    def aggregateByKey[U](zeroValue: U)
                         (seqOp: (U, V) ⇒ U, 
                          combOp: (U, U) ⇒ U)
                         (implicit arg0: ClassTag[U]): RDD[(K, U)]
    

    The way type inference works in Scala, is that the compiler is able to infer the type of the second argument list based on the first. So in the first example, it knows that that seqOp is a function ((Int, Int), String) => (Int, Int), same goes for combOp.

    On the contrary, combineByKey there's only a single argument list:

    combineByKey[C](createCombiner: (V) ⇒ C, 
                    mergeValue: (C, V) ⇒ C, 
                    mergeCombiners: (C, C) ⇒ C): RDD[(K, C)] 
    

    And without explicitly stating the types, the compiler doesn't know what to infer x and y to.

    What you can do to help the compiler is to explicitly specify the type arguments:

    animals23
      .combineByKey[(Int, Int)](x => (x.length,1), 
                               (x, y) => (x._1 + y.length, x._2 + 1),
                               (x, y) => (x._1 + y._1, x._2 + y._2))
      .map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble))
      .collect