Search code examples
scalaapache-sparkrdd

In Scala, how would I take a Spark RDD, and output to different files, grouped by the values of a column?


If there was a Spark RDD like such:

id  | data
----------
1   | "a"
1   | "b"
2   | "c"
3   | "d"

How could I output this to separate json textfiles, grouped based on the id? Such that part-0000-1.json would contain rows "a" and "b", part-0000-2.json contains "c", etc.


Solution

  • df.write.partitionBy("col").json(<path_to_file>)
    

    is what you need.