Search code examples
scalagroup-byapache-sparkseq

Spark Flatten Seq by reversing groupby, (i.e. repeat header for each sequence in it)


We have an RDD with the following form:

org.apache.spark.rdd.RDD[((BigInt, String), Seq[(BigInt, Int)])]

What we would like to do is flatten that into a single list of tab delimited strings to save with saveAsText file. And by flatten, I mean repeat the groupby tuple (BigInt, String) for each item in its Seq.

So the data that looks like..

((x1,x2), ((y1.1,y1.2), (y2.1, y2.2) .... ))

... Will wind up looking like

x1   x2   y1.1  y1.2
x1   x2   y2.1  y2.2

So far the code I've tried mostly flattens it all to just one line, "x1, x2, y1.1, y1.2, y2.1, y2.2 ..." etc...

Any help would be appreciated, thanks in advance!


Solution

  • If you want to flatten the results of a groupByKey() operation so that both the key and value columns are flattened into one tuple, I recommend using flatMap:

    val grouped = sc.parallelize(Seq(((1,"two"), List((3,4), (5,6)))))
    val flattened: RDD[(Int, String, Int, Int)] = grouped.flatMap { case (key, groupValues) =>
       groupValues.map { value => (key._1, key._2, value._1, value._2) }
    }
    // flattened.collect() is Array((1,two,3,4), (1,two,5,6))
    

    From here, you can use additional transformations and actions to convert your combined tuple into a tab-separated string and to save the output.

    If you don't care about the flattened RDD containing Tuples, then you can write the more general

     val flattened: RDD[Array[Any]] = grouped.flatMap { case (key, groupValues) =>
       groupValues.map(value => (key.productIterator ++ value.productIterator).toArray)
     }
     // flattened.collect() is Array(Array(1, two, 3, 4), Array(1, two, 5, 6))
    

    Also, check out the flatMapValues transformation; if you have an RDD[(K, Seq[V]])] and want RDD[(K, V)], then you can do flatMapValues(identity).