Search code examples
scalahadoopapache-sparkrdd

Saving to a custom output format in Spark / Hadoop


I have one RDD which contains multiple datastructures, whereas one of these data structures is a Map[String, Int].

To visualize it easily I get the following after a map transformation:

val data = ... // This is a RDD[Map[String, Int]]

In one of the elements of this RDD, the Map contains the following:

*key value*
map_id -> 7753
Oscar -> 39
Jaden -> 13
Thomas -> 1
Chris -> 52

And then it contains other names and numbers in other elements of the RDD, each map contains a certain map_id. Anyhow, if I simply do data.saveAsTextFile(path), I will get the following output in my file:

Map(map_id -> 7753, Oscar -> 39, Jaden -> 13, Thomas -> 1, Chris -> 52)
Map(...)
Map(...)

However, I would like to format it as the following:

---------------------------
map_id: 7753
---------------------------
Oscar: 39
Jaden: 13
Thomas: 1
Chris: 52

---------------------------
map_id: <some other id>
---------------------------
Name: nbr
Name2: nbr2

Basically, the map_id as some kind of header, then the contents, one line of space and then the next element.

To my question, data RDD only has two options, save as text file or as object file, which neither as far as I can see support my to customize the formatting. How could I go about doing this?


Solution

  • You can just map to String and write the result. For example:

    def format(map: Map[String, Int]): String = {
      val id = map.get("map_id").map(_.toString).getOrElse("unknown")
      val content = map.collect {
        case (k, v) if k != "map_id" => s"$k: $v" 
      }.mkString("\n")
      s"""|---------------------------
          |map_id: $id
          |-------------------------------
          |$content
      """.stripMargin
    }
    
    data.map(format(_)).saveAsTextFile(path)