Search code examples
scalaapache-sparkapache-spark-sqlrdd

How to get count of year using spark scala


I have following Movies data which is like below, I should get count of movies in each year like 2002,2 and 2004,1

Littlefield, John (I)   x House 2002
Houdyshell, Jayne   demon State 2004
Houdyshell, Jayne   mall in Manhattan   2002

val data=sc.textFile("..line to file")
val dataSplit=data.map(line=>{var d=line.split("\t");(d(0),d(1),d(2))})

What i am unable to understand is when i use dataSplit.take(2).foreach(println) I see that d(0) is first two columns Littlefield, John (I) which are firstname and lastname and d(1) is movie name such as "x House" and d(2) is year. How can i get the count of movies each year?


Solution

  • Use reduceByKey with the mapped tuple in this way.

    val dataSplit = data
      .map(line => {var d = line.split("\t"); (d(2), 1)}) // (2002, 1)
      .reduceByKey((a, b) => a + b)
    
    // .collect() gives the result: Array((2004,1), (2002,2))