Search code examples
javaapache-sparkmapreducerdd

Find elements in one RDD but not in ther other RDD


I have two JavaRDD A and B. I want to only keep longs that are in A but not in B. How should I do that? Thanks!


Solution

  • I am posting a solution in scala. Should be almost similar in Java.

    Do a leftOuterJoin which would give all the records in the first rdd alongwith matching records from the second rdd. Like WrappedArray((168,(def,None)), (192,(abc,Some(abc)))). But to keep the record only present in first rdd, we apply a filter over None.

    val data = spark.sparkContext.parallelize(Seq((192, "abc"),(168, "def")))
    val data2 = spark.sparkContext.parallelize(Seq((192, "abc")))
    
    val result = data
    .leftOuterJoin(data2)
    .filter(record => record._2._2 == None)
    
    println(result.collect.toSeq)
    Output> WrappedArray((168,(def,None)))