Search code examples
stringalgorithmstring-comparisonsimilaritysentence-similarity

Is this already a string similarity algorithm?


I'm unfamiliar with string similarity algorithms except for Levenshtein Distance because that's what I'm using and it has turned out to be less than ideal.

So I've kind of got an idea of a recursive algorithm I'd like to implement but I want to know if it exists already so I can leverage other's expertise.

Here's the algorithm by example:

string 1: "Paul Johnson"

string 2: "John Paulson"

Step 1: find all longest matches

Match 1: "Paul"

Match 2: "John"

Match 3: "son"

Match 4: " "

Step 2: Calculate scores for each match with this formula: ((match.len/string.len)*match.len) This allows longer strings to be weighted more at a balanced rate based on the length of the string.

Match 1: (4/12)*4 = 1.333...

Match 2: 1.333...

Match 3: .75

Match 4: .083

Step 3: do steps 1 and 2 on larger scales, (matches of matches.) This I don't have figured out exactly. but my thinking is if "son" comes after "Paul John" and it comes after "John Paul" then that should count for something.

Step 4: sum all the scores that have been calculated.

Scores: 1.333 + 1.333 + .75 + .083333 = 3.4999... (plus whatever scores step 3 produces)

Does this look familiar to anyone? I hope someone else has gone to the trouble of actually making an algorithm along these lines so I don't have to figure it out myself.


Solution

  • What you describe somewhat resembles what the following paper calls the Longest Common Substring (LCS). For a brief description and comparison to other algorithms: A Comparison of Personal Name Matching

    This algorithm [11] repeatedly finds and removes the longest common sub-string in the two strings compared, up to a minimum lengths (normally set to 2 or 3).

    ...
    A similarity measure can be calculated by dividing the total length of the common sub-strings by the minimum, maximum or average lengths of the two original strings (similar to Smith-Waterman).

    ...

    this algorithm is suitable for compound names that have words (like given- and surname) swapped.