Search code examples
sortingapache-sparkpysparkrdd

How to show top N number of results with customization in spark rdd?


val sorting = sc.parallelize(List(1,1,1,2,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,7,8,8,8,8,8))
sorting.map(x=>(x,1)).reduceByKey((a,b)=>a+b).map(x=>(x._1,"==>",x._2)).sortBy(s=>s._2,false).collect.foreach(println)    
output:
(8,==>,5)
(1,==>,3)
(2,==>,4)
(3,==>,3)
(4,==>,4)
(5,==>,3)
(6,==>,2)
(7,==>,1)

I want to show only top 3 results and remove , (comma) from the result.


Solution

  • use take(3) instead of collect to get the top 3 results, and then clean up the output manually:

    sorting.map(x=>(x,1)).reduceByKey((a,b)=>a+b).sortBy(s=>s._2,false).map(x=>s"${x._1} ${x._2}").take(3).foreach(println)
    
    8 5
    2 4
    4 4