I've been using the [Simmetrics][1] Java library with good success for comparing two Strings with good success. But there seem to be two approaches and I need a combination of both for my scenario.
Currently I am using CosineSimilarity (I do use some simplifiers but have omitted here to keep code simple)
StringMetric metric = with(new CosineSimilarity<String>())
.tokenize(Tokenizers.whitespace()).build();
score = metric.compare(string1, string2);
and this works quite well except I when there is a simple misspelling I would have expected a higher score than I get
e.g comparing mony honey and money honey only returns 0.5 (scores go from 0.0 to 1.0 with 1.0 being perfect match), I would have expected higher.
With Levenshtein it returns a better 0.9090909
But one thing I noted reading the documentation was that this is a MultiSet metric, and that the whitespace() is actually required to break the input into parts, whereas a StringMetric such as Levenshtein does not
StringMetric metric = with(new Levenshtein())
.build();
This then implies do me that Levenshtein doesnt consider whitespace specially which is an issue as I want to match words and essentially ignore the whitespace or order.
so for example using CosineSimilarity it returns 1.0 when comparing honey trap and trap honey but Levenshtein return 0.0, that is no good for me.
What I ideally want is word order to not be important, and then for individual words to be a good match if there are just slight variations in the word e.g money/mony
The Strings can be in any language, but are most usually in English, they are song titles so are usually less than ten words long, typically about 5 words long.
Does Simmetrics offer another algorithm that can provide both these parts ?
There are simplifiers such as RefinedSoundex that could be applied to input, but because the language may not be in English dont think that would work very well.
What do you think would be the best algorithm to use ?
Simmetrics contains metrics for comparing Strings, Lists, Sets and Multisets.
The Levenshtein distance between two words is the minimum number of single-character edits. Whitespace is a character too so a difference in whitespace causes a difference in similarity.
Cosine similarity is the similarity between two zero vectors (which for convenience are presented as multisets). So without some form of processing cosine similarity is simply not suited to compare strings.
Depending on how you split the string you may end up comparing entirely different things. If you split the string on whitespace you will end up comparing documents by their similarity in word usage. If you split the string on n-grams you'll compare strings on their letter pairs which tends to work well against typos.
For your particular use case you may want to look into tokenizing on whitespace and then tokenizing on q-grams. Then try either CosineSimilarity,Tanimoto, Dice, SimonWhite, Jaccard.
E.g:
/**
* Tokenizers can also be chained.
*
* `chilperic ii son of childeric ii`
*
* By splitting on whitespace is tokenized into:
*
* `[chilperic, ii, son, of, childeric, ii]`
*
* After using a q-gram with a q of 2:
*
* `[ch,hi,il,il,lp,pe,er,ri,ic, ii, so,on, of, ch,hi,il,ld,de,er,ri,ic,
* ii]`
*
*/
public static float example04() {
String a = "A quirky thing it is. This is a sentence.";
String b = "This sentence is similar; a quirky thing it is.";
StringMetric metric =
with(new CosineSimilarity<String>())
.tokenize(Tokenizers.whitespace())
.tokenize(Tokenizers.qGram(3))
.build();
return metric.compare(a, b); // 0.8292
}
To make your decision you can take a number of representative queries and compare the results by their precision and recall. Then you can make a good decision on which metric to use.
Full disclosure: I'm the current maintainer of Simmetrics Project.