We have an RDD with the following form:
org.apache.spark.rdd.RDD[((BigInt, String), Seq[(BigInt, Int)])]
What we would like to do is flatten that into a single list of tab delimited strings to save with saveAsText file. And by flatten, I mean repeat the groupby tuple (BigInt, String) for each item in its Seq.
So the data that looks like..
((x1,x2), ((y1.1,y1.2), (y2.1, y2.2) .... ))
... Will wind up looking like
x1 x2 y1.1 y1.2
x1 x2 y2.1 y2.2
So far the code I've tried mostly flattens it all to just one line, "x1, x2, y1.1, y1.2, y2.1, y2.2 ..." etc...
Any help would be appreciated, thanks in advance!
If you want to flatten the results of a groupByKey() operation so that both the key and value columns are flattened into one tuple, I recommend using flatMap:
val grouped = sc.parallelize(Seq(((1,"two"), List((3,4), (5,6)))))
val flattened: RDD[(Int, String, Int, Int)] = grouped.flatMap { case (key, groupValues) =>
groupValues.map { value => (key._1, key._2, value._1, value._2) }
}
// flattened.collect() is Array((1,two,3,4), (1,two,5,6))
From here, you can use additional transformations and actions to convert your combined tuple into a tab-separated string and to save the output.
If you don't care about the flattened RDD containing Tuples
, then you can write the more general
val flattened: RDD[Array[Any]] = grouped.flatMap { case (key, groupValues) =>
groupValues.map(value => (key.productIterator ++ value.productIterator).toArray)
}
// flattened.collect() is Array(Array(1, two, 3, 4), Array(1, two, 5, 6))
Also, check out the flatMapValues
transformation; if you have an RDD[(K, Seq[V]])]
and want RDD[(K, V)]
, then you can do flatMapValues(identity)
.