Search code examples
scalaapache-sparkrdd

scala spark rdd joing two tables with the same id


I have the following rdds:

case class Rating(user_ID: Integer, movie_ID: Integer, rating: Integer, timestamp: String)
case class Movie(movie_ID: Integer, title: String, genre: String)

I join them together in scala, like:

val m = datamovie.keyBy(_.movie_ID)
val r = data.keyBy(_.movie_ID)
val mr = m.join(r)  

I get back my result like RDD[(Int, (Movie, Rating))] how can I print the tile of the movies that have the rating 5 for example. I am not quit sure how to work with the new rdd that was created with the join!


Solution

  • Convert them to spark dataframe and perform joins. Is there a specific reason you wanted to keep em RDD's

    val m = datamovie.toDF
    val r = data.toDF
    val mr = m.join(r, Seq("movie_id"), "left").where($"rating" === "5").select($"title")