Search code examples
arraysscalaapache-sparkrdd

Create Tuple out of Array(Array[String) of Varying Sizes using Scala


I am new to scala and I am trying to make a Tuple pair out an RDD of type Array(Array[String]) that looks like:

(122abc,223cde,334vbn,445das),(221bca,321dsa),(231dsa,653asd,698poq,897qwa)

I am trying to create Tuple Pairs out of these arrays so that the first element of each array is key and and any other part of the array is a value. For example the output would look like:

122abc    223cde
122abc    334vbn
122abc    445das
221bca    321dsa
231dsa    653asd
231dsa    698poq
231dsa    897qwa

I can't figure out how to separate the first element from each array and then map it to every other element.


Solution

  • If I'm reading it correctly, the core of your question has to do with separating the head (first element) of the inner arrays from the tail (remaining elements), which you can use the head and tail methods. RDDs behave a lot like Scala lists, so you can do this all with what looks like pure Scala code.

    Given the following input RDD:

    val input: RDD[Array[Array[String]]] = sc.parallelize(
      Seq(
        Array(
          Array("122abc","223cde","334vbn","445das"),
          Array("221bca","321dsa"),
          Array("231dsa","653asd","698poq","897qwa")
        )
      )
    )
    

    The following should do what you want:

    val output: RDD[(String,String)] =
      input.flatMap { arrArrStr: Array[Array[String]] =>
        arrArrStr.flatMap { arrStrs: Array[String] =>
          arrStrs.tail.map { value => arrStrs.head -> value }
        }
      }
    

    And in fact, because of how the flatMap/map is composed, you could re-write it as a for-comprehension.:

    val output: RDD[(String,String)] =
      for {
        arrArrStr: Array[Array[String]] <- input
        arrStr: Array[String] <- arrArrStr
        str: String <- arrStr.tail
      } yield (arrStr.head -> str)
    

    Which one you go with is ultimately a matter of personal preference (though in this case, I prefer the latter, as you don't have to indent code as much).

    For verification:

    output.collect().foreach(println)
    

    Should print out:

    (122abc,223cde)
    (122abc,334vbn)
    (122abc,445das)
    (221bca,321dsa)
    (231dsa,653asd)
    (231dsa,698poq)
    (231dsa,897qwa)