Search code examples
javaoverlapsimilarityjaro-winkler

How to compute Overlap Coeffecient and Jaro Winkler using Simmetrics java


I have been trying to use the Sim-metrics library from:

    <dependency>
        <groupId>com.github.mpkorstanje</groupId>
        <artifactId>simmetrics-core</artifactId>
        <version>4.1.0</version>
    </dependency>

So far I am computing Jaro Winkler using:

StringMetric sm = StringMetrics.jaroWinkler();
res = sm.compare("Harry Potter", "Potter Harry");
System.out.println(res);

0.43055558

and Cosine Similarity by:

sm  = StringMetrics.overlapCoefficient();
res = sm.compare("The quick brown fox", "The slow brawn fur");
System.out.println(res); 

0.25

but according to https://asecuritysite.com/forensics/simstring

The jaro-winkler should be 0 for this, and the overlap coeffecient should be 100. Is this even the correct way to use this library? What is the proper calls, say if I want to run both these metrics to match movies from one list to another I got from IMDB, I am intending to compare the titles from both sets and get the average of both scores and do the same for the cast from both sets of movies. Thanks


Solution

  • You are using the library correctly. You may however wish to customize the metric you are using. It sounds like filtering short, common words like 'the', 'a' 'and', ect, and using a q-gram tokenizer might be more effective then using the default metric from StringMetrics most of which tokenize on whitespace and none apply filters or simplifiers.

    Beyond that I can't really tell you which combination metrics, tokenizers, filters and simplifiers may work for your use case. What works best is rather domain specific. You'll have to try a few combinations and see what works best.


    When I use the website you provided to calculate the Cosine Similarity and Overlap Coefficient of The quick brown fox and The slow brawn fur I get:

    String 1: The quick brown fox
    String 2: The slow brawn fur
    
    The results are then:
    Cosine Similarity   25
    Overlap Coefficient 25
    

    When I use Simmetrics.

    System.out.println(
      StringMetrics.overlapCoefficient().compare(
        "The quick brown fox", "The slow brawn fur")); // 0.25
    System.out.println(
      StringMetrics.cosineSimilarity().compare(
         "The quick brown fox", "The slow brawn fur")); // 0.25
    

    Regarding Jaro Winkler it looks like the website it using an older version of Simmetrics. The specific combination of metrics and names, specifically Chapman Length Deviation, which was originally written by the original author of Simmetrics Sam Chapman leave little doubt about it.

    The older versions had some peculiarities though I can't point the specific one which is causing this difference without debugging them side by side again.