Search code examples
javaduplicatesbigdatadata-miningshingles

Comparing Shingles for Near-Duplicate Detection


I'm working on shingling code to compare near-duplicates. I'm getting a little stuck on the compare code. This is my rough attempt so far.

//shingles are already hashed integers and I'm working on the evaluation to true via the float similar parameter.
public static boolean compareShingles(float similar, CompareObject comp1, CompareObject comp2) {
        int intersections = 0;
        if(comp1.getShingle().size()>=comp2.getShingle().size()){
        for(int i = 0; i < comp1.getShingle().size(); i++){

              if(comp1.getShingle().get(i).equals(comp2.getShingle().get(i))){
              intersections++;
              }

        }
        }
        else{
              for(int i = 0; i < comp2.getShingle().size(); i++){
                    if(comp2.getShingle().get(i).equals(comp1.getShingle().get(i))){
                    intersections++;
                    }

              }
        }
        return true; //not functional still working on when to return true
  }

I'm a little stuck on if I should be comparing these shingles 1-1 in an array or if I should compare one shingle to all of the shingles in a loop.

For example, if I looped comparing every shingle to every other shingle then these documents would be identical...

{blah blah blah, Once upon a, time blah blah}
{Once upon a, time blah blah, blah blah blah}

If I did a positional compare on the same docs then position 1 would be "blah blah blah" compared to "Once upon a" and that would return false.

I'm thinking looping would be more process intensive but it might be the correct option. Thoughts?


Solution

  • The order doesn't matter..

    You basically make the shingle sets and compare them with Jaccard Similarity. It helps to have a hash to automatically throw away duplicate shingles. Just count the matches between each document and figure out how many need to match to consider them similar.

    http://ethen8181.github.io/machine-learning/clustering_old/text_similarity/text_similarity.html