Search code examples
scalalistapache-sparkrdd

Spark: RDD to List


I have a RDD structure

RDD[(String, String)]

and I want to create 2 Lists (one for each dimension of the rdd).

I tried to use the rdd.foreach() and fill two ListBuffers and then convert them to Lists, but I guess each node creates its own ListBuffer because after the iteration the BufferLists are empty. How can I do it ?

EDIT : my approach

val labeled = data_labeled.map { line =>
  val parts = line.split(',')
  (parts(5), parts(7))
}.cache()

var testList : ListBuffer[String] = new ListBuffer()

labeled.foreach(line =>
  testList += line._1
)
  val labeledList = testList.toList
  println("rdd: " + labeled.count)
  println("bufferList: " + testList.size)
  println("list: " + labeledList.size)

and the result is:

rdd: 31990654
bufferList: 0
list: 0

Solution

  • If you really want to create two Lists - meaning, you want all the distributed data to be collected into the driver application (risking slowness or OutOfMemoryError) - you can use collect and then use simple map operations on the result:

    val list: List[(String, String)] = rdd.collect().toList
    val col1: List[String] = list.map(_._1)
    val col2: List[String] = list.map(_._2)
    

    Alternatively - if you want to "split" your RDD into two RDDs - it's pretty similar without collecting the data:

    rdd.cache() // to make sure calculation of rdd is not repeated twice
    val rdd1: RDD[String] = rdd.map(_._1)
    val rdd2: RDD[String] = rdd.map(_._2)
    

    A third alternative is to first map into these two RDDs and then collect each one of them, but it's not much different from the first option and suffers from the same risks and limitations.