Search code examples
cstringperlperl-modulejaro-winkler

What is the third parameter to Text::JaroWinkler::strcmp95 for?


I am interested in the Jaro-Winkler module written in Perl to compute the distance (or similarity) between two strings:

http://search.cpan.org/~scw/Text-JaroWinkler-0.1/JaroWinkler.pm

The syntax of the function is not clear to me; I could not find any clear documentation of it.

Here is the sample code:

#!/usr/bin/perl

use 5.10.0;
use Text::JaroWinkler qw( strcmp95 );
print strcmp95("it is a dog","i am a dog.",11);

What exactly does the 11 represent? I gather it is a length. Which length? The length of the amount of characters I want checked? Is it required to be there?


Solution

  • See the source for an answer to your question. It contains this line:

    $ying = sprintf("%*.*s", -$y_length, $y_length, $ying);
    

    So $y_length is being used to reformat the strings, padding them if necessary and trimming them to an identical length. These equal-length strings are then fed into the actual comparison function. This suggests that Alex is correct and giving a length of max(length $ying, length $yang) is going to give the best results under most circumstances.

    Reading the source also reveals that if you fail to supply $y_length, no default is supplied. So you'll be comparing the empty string to the empty string. Those should have a pretty short JW distance.