Search code examples
javahadoopmapreducecascading

how to use similarity algorithm for keys of hadoop map resuce job


I need to implement the following problem: I'm getting the data of type

public class Data{   
  private String key;
  private String valueData;
}

I need to write a map reduce job to get all unique keys, with one (random) valueData for each of them. Sound very simple for hadoop, and Yep, I know how to implement this.

But the real problem is that, I need also to reduce all "similar" keys. And the output should be one of the similar key with one of the dataValue

What is the best way (and how) to implemet this in hadoop? I would also like to have a flexebility to change the similarity algorithm.


Solution

  • Have a look at the MinHashing technique, it is widely used with MapReduce for this task.

    The similarity metric is bound to Jaccard, not sure if there are other approaches. However once you computed near keys you can use another metric to measure the similarity between them, because minhashing has drastically reduced your searchspace.

    You can read more on wikipedia: http://en.wikipedia.org/wiki/MinHash

    Mahout has a MinHash clustering algorithm, you can have a look there. It is pretty easy to understand and features several hashing algorithms.