I'm writing a piece of java software that has to make the final judgement on the similarity of two documents encoded in UTF-8.
The two documents are very likely to be the same, or slightly different from each other, because they have many features in common like date, location, creator, etc., but their text is what decides if they really are.
I expect the text of the two documents to be either very similar or not at all, so I can be rather strict about the threshold to set for similarity. For example I could say that the two documents are similar only if they have 90% of their words in common, but I would like to have something more robust, which would work for texts short and long alike.
To sum it up I have:
I've experimented with simmetrics, which has a large array of string matching function, but I'm most interested in suggestion about possible algorithms to use.
Possible candidates I have are:
Also considering two texts similar only when they are exactly the same would not work well, because I'd like for documents that differ only for a few words to pass the similarity test.
Levenshtein distance is the standard measure for a reason: it's easy to compute and easy to grasp the meaning of. If you are wary of the number of characters in a long document, you can just compute it on words or sentences or even paragraphs instead of characters. Since you expect the similar pairs to be very similar, that should still work well.