I am able to calculate the mean word length per starting letter for the spark collection
val animals23 = sc.parallelize(List(("a","ant"), ("c","crocodile"), ("c","cheetah"), ("c","cat"), ("d","dolphin"), ("d","dog"), ("g","gnu"), ("l","leopard"), ("l","lion"), ("s","spider"), ("t","tiger"), ("w","whale")), 2)
either with
animals23.
aggregateByKey((0,0))(
(x, y) => (x._1 + y.length, x._2 + 1),
(x, y) => (x._1 + y._1, x._2 + y._2)
).
map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble)).
collect
or with
animals23.
combineByKey(
(x:String) => (x.length,1),
(x:(Int, Int), y:String) => (x._1 + y.length, x._2 + 1),
(x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2)
).
map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble)).
collect
each resulting in
Array((a,3.0), (c,6.333333333333333), (d,5.0), (g,3.0), (l,5.5), (w,5.0), (s,6.0), (t,5.0))
What I do not understand: Why am I required to explicitly state the types in the functions in the second example while the first example's functions can do without?
I am talking about
(x, y) => (x._1 + y.length, x._2 + 1),
(x, y) => (x._1 + y._1, x._2 + y._2)
vs
(x:(Int, Int), y:String) => (x._1 + y.length, x._2 + 1),
(x:(Int, Int), y:(Int, Int)) => (x._1 + y._1, x._2 + y._2)
and it might be more a Scala than a Spark question.
Why am I required to explicitly state the types in the functions in the second example while the first example's functions can do without?
Because in the first example, the compiler is able to infer the type of seqOp
based on the first argument list supplied. aggregateByKey
is using currying:
def aggregateByKey[U](zeroValue: U)
(seqOp: (U, V) ⇒ U,
combOp: (U, U) ⇒ U)
(implicit arg0: ClassTag[U]): RDD[(K, U)]
The way type inference works in Scala, is that the compiler is able to infer the type of the second argument list based on the first. So in the first example, it knows that that seqOp
is a function ((Int, Int), String) => (Int, Int)
, same goes for combOp
.
On the contrary, combineByKey
there's only a single argument list:
combineByKey[C](createCombiner: (V) ⇒ C,
mergeValue: (C, V) ⇒ C,
mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
And without explicitly stating the types, the compiler doesn't know what to infer x
and y
to.
What you can do to help the compiler is to explicitly specify the type arguments:
animals23
.combineByKey[(Int, Int)](x => (x.length,1),
(x, y) => (x._1 + y.length, x._2 + 1),
(x, y) => (x._1 + y._1, x._2 + y._2))
.map(x => (x._1, x._2._1.toDouble / x._2._2.toDouble))
.collect