Search code examples
javaduplicateslevenshtein-distanceconcurrentmodification

java - Remove nearly duplicates from a List


I have a List of Tweet objects (homegrown class) and I want to remove NEARLY duplicates based on their text, using the Levenshtein distance. I have already removed the identical duplicates by hashing the tweets' texts but now I want to remove texts that are identical but have up to 2-3 different characters. Since this is a O(n^2) approach, I have to check every single tweet text with all the others available. Here's my code so far:

int distance;
for(Tweet tweet : this.tweets) {
     distance = 0;
     Iterator<Tweet> iter = this.tweets.iterator();
     while(iter.hasNext()) {
         Tweet currentTweet = iter.next();
         distance = Levenshtein.distance(tweet.getText(), currentTweet.getText());
         if(distance < 3 && (tweet.getID() != currentTweet.getID())) {
             iter.remove();
         }
     }
}

The first problem is that the code throws ConcurrentModificationException at some point and never completes. The second one: can I do anything better than this double loop? The list of tweets contains nearly 400.000 tweets so we're talking about 160 billion iterations!


Solution

  • As for the ConcurrentModificationException: As the others pointed out, I was removing elements from a list that I was also iterating in the outer for-each. Changing the for-each into a normal for resolved the problem.

    As for the O(n^2) approach: There's no "better" algorithm regarding its complexity, than a O(n^2) approach. What I'm trying to do is an "all-to-all" comparison to find nearly duplicate elements. Of course there are optimizations to lower the total capacity of n, parallelization to concurrently parse sub-lists of the original list, but the complexity is quadratic at all times.