Search code examples
scalaapache-sparkrdddistinct-values

Foreach on a RDD distinct does not work


I'm trying to concat all distinct values of a Spark RDD, separating them with comma. This is my code:

def genPredicateIn(data: RDD[String], attribute: String): String = {
  var s: String = attribute + " in {"
  val distinct = data.distinct
  distinct.foreach(s += ", " + _)
  s += "}"
  s
}

But it returns to me just "attribute in {}", why? Which is my mistake?

It works if I write val array <- data.distinct.collect and iterate on that. Why?


Solution

  • Running a similar example on PySpark, I get "lambda cannot contain assignment", so I assume Scala would work the same.

    You should be able to collect the RDD, and then do the comma-join. That is essentially what you are doing anyway.

    The reason you probably can't do assignment is because Spark can't pass your string to all the worker nodes and concatenate it with the diffrent partitions of data, then accumulate the result to pass it back to the running code.