Search code examples
javastring-comparison

Similarity String Comparison in Java


I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:

  • "The quick fox jumped" -> "The fox jumped"
  • "The quick fox jumped" -> "The fox"

This comparison would return that the first is more similar than the second.

I guess I need some method such as:

double similarityIndex(String s1, String s2)

Is there such a thing somewhere?

EDIT: Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work


Solution

  • Yes, there are many well documented algorithms like:

    • Cosine similarity
    • Jaccard similarity
    • Dice's coefficient
    • Matching similarity
    • Overlap similarity
    • etc etc

    A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

    Also check these projects: