Search code examples
javaapache-sparklevenshtein-distancefuzzy-comparisonelasticsearch-hadoop

Apache Spark: JOINing RDDs (data sets) using custom criteria/fuzzy matching


Is it possible to join two (Pair)RDDs (or Datasets/DataFrames) (on multiple fields) using some "custom criteria"/fuzzy matching, e.g. range/interval for numbers or dates and various "distance methods", e.g. Levenshtein, for strings?

For "grouping" within an RDD to get a PairRDD, one can implement a PairFunction, but it seems that something similar is not possible when JOINing two RDDs/data sets? I am thinking something like:

rdd1.join(rdd2, myCustomJoinFunction);

I was thinking about implementing the custom logic in hashCode() and equals() but I am not sure how to make "similar" data wind up in the same bucket. I have also been looking into RDD.cogroup() but have not figured out how I could use it to implement this.

I just came across elasticsearc-hadoop. Does anyone know if that library could be used to do something like this?

I am using Apache Spark 2.0.0. I am implementing in Java but an answer in Scala would also be very helpful.

PS. This is my first Stackoverflow question so bear with me if I have made some newbie mistake :).


Solution

  • For DataFrames/Datasets you can use join with custom join function. Create an UDF that will be using columns from DataFrame, just like in this question in first answer.

    You can also do

    rdd1.cartesian(rdd2).filter (...)
    

    Remember that it will consume much time to calculate