Search code examples
apache-sparkmapreducerddspark-graphx

Map each element of a list in Spark


I'm working with an RDD which pairs are structured this way: [Int, List[Int]] my goal is to map the items of the list of each pair with the key. So for example I'd need to do this:

RDD1:[Int, List[Int]]
<1><[2, 3]>
<2><[3, 5, 8]>

RDD2:[Int, Int]
<1><2>
<1><3>
<2><3>
<2><5>
<2><8>

well I can't understand what kind of transformation would be needed in order to get to RDD2. Transformations list can be found here. Any Idea? Is it a wrong approach?


Solution

  • You can use flatMap:

     val rdd1 = sc.parallelize(Seq((1, List(2, 3)), (2, List(3, 5, 8))))
     val rdd2 = rdd1.flatMap(x => x._2.map(y => (x._1, y)))
    
     // or:
     val rdd2 = rdd1.flatMap{case (key, list) => list.map(nr => (key, nr))}
    
     // print result:
     rdd2.collect().foreach(println)
    

    Gives result:

    (1,2)
    (1,3)
    (2,3)
    (2,5)
    (2,8)
    

    flatMap created few output objects from one input object.

    In your case, inner map in flatMap maps tuple (Int, List[Int]) to List[(Int, Int)] - key is the same as input tuple, but for each element in input list it creates one output tuple. flatMap causes that each element of this List becomes one row in RDD