Search code examples
javastring-matchinglevenshtein-distancesimilarity

what is a good metric for deciding if 2 Strings are "similar enough"


I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.

What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".

However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?

Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?

UPDATE

I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.


So far, my algorithm (pseudo code) is roughly along the lines of:

/*
 * The input lists represent the Strings I want to test for similarity. The
 * Strings are split apart based on new lines / carriage returns because Java
 * stack traces are not a giant one-line String, rather a multi-line String.
 * So each element in the input lists is a "line" from its stack trace.
 */
calculate similarity (List<String> list1, List<String> list2) {

    length1 = 0;
    length2 = 0;
    levenshteinDistance = 0;

    iterator1 = list1.iterator();
    iterator2 = list2.iterator();

    while ( iterator1.hasNext() && iterator2.hasNext() ) {

        // skip blank/empty lines because they are not interesting
        str1 = iterator1.next();    length1 += str1.length();
        str2 = iterator2.next();    length2 += str2.length();

        levensteinDistance += getLevenshteinDistance(str1, str2);
    }

    // handle the rest of the lines from the iterator that has not terminated

    difference = levenshteinDistance / Math.max(length1, length2);

    return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}

Solution

  • How about using cosine similarity? This is a general technique to assess similarity between two texts. It works as follows:

    Take all the letters from both Strings an build a table like this:

    Letter | String1 | String2
    

    This can be a simple hash table or whatever.

    In the letter column put each letter and in the string columns put their frequency inside that string (if a letter does not appear in a string the value is 0).

    It is called cosine similarity because you interpret each of the two string columns as vectors, where each component is the number associated to a letter. Next, compute the cosine of the "angle" between the vectors as:

    C = (V1 * V2) / (|V1| * |V2|)
    

    The numerator is the dot product, that is the sum of the products of the corresponding components, and the denominator is the product of the sizes of the vectors.

    How close C is to 1 gives you how similar the Strings are.

    It may seem complicated but it's just a few lines of code once you understand the idea.

    Let's see an example: consider the strings

    s1 = aabccdd
    s2 = ababcd
    

    The table looks like:

    Letter a b c d
    s1     2 1 2 2
    s2     2 2 1 1
    

    And thus:

    C = (V1 * V2) / (|V1| * |V2|) = 
    (2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877
    

    So they are "pretty" similar.