I have a List of Tweet objects (homegrown class) and I want to remove NEARLY duplicates based on their text, using the Levenshtein distance. I have already removed the identical duplicates by hashing the tweets' texts but now I want to remove texts that are identical but have up to 2-3 different characters. Since this is a O(n^2) approach, I have to check every single tweet text with all the others available. Here's my code so far:
int distance;
for(Tweet tweet : this.tweets) {
distance = 0;
Iterator<Tweet> iter = this.tweets.iterator();
while(iter.hasNext()) {
Tweet currentTweet = iter.next();
distance = Levenshtein.distance(tweet.getText(), currentTweet.getText());
if(distance < 3 && (tweet.getID() != currentTweet.getID())) {
iter.remove();
}
}
}
The first problem is that the code throws ConcurrentModificationException at some point and never completes. The second one: can I do anything better than this double loop? The list of tweets contains nearly 400.000 tweets so we're talking about 160 billion iterations!
As for the ConcurrentModificationException: As the others pointed out, I was removing elements from a list that I was also iterating in the outer for-each. Changing the for-each into a normal for resolved the problem.
As for the O(n^2)
approach: There's no "better" algorithm regarding its complexity, than a O(n^2)
approach. What I'm trying to do is an "all-to-all" comparison to find nearly duplicate elements. Of course there are optimizations to lower the total capacity of n
, parallelization to concurrently parse sub-lists of the original list, but the complexity is quadratic at all times.