Search code examples
perlalgorithmstringsimilarity

How do I determine the longest similar portion of several strings?


As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.

Example:

  • file:///home/gms8994/Music/t.A.T.u./
  • file:///home/gms8994/Music/nina%20sky/
  • file:///home/gms8994/Music/A%20Perfect%20Circle/

Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's common for all 3 strings.

Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.

From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.


Solution

  • Edit: I'm sorry for mistake. My pity that I overseen that using my variable inside countit(x, q{}) is big mistake. This string is evaluated inside Benchmark module and @str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.

    Perl can be fast:

    use strict;
    use warnings;
    
    package LCP;
    
    sub LCP {
        return '' unless @_;
        return $_[0] if @_ == 1;
        my $i          = 0;
        my $first      = shift;
        my $min_length = length($first);
        foreach (@_) {
            $min_length = length($_) if length($_) < $min_length;
        }
    INDEX: foreach my $ch ( split //, $first ) {
            last INDEX unless $i < $min_length;
            foreach my $string (@_) {
                last INDEX if substr($string, $i, 1) ne $ch;
            }
        }
        continue { $i++ }
        return substr $first, 0, $i;
    }
    
    # Roy's implementation
    sub LCP2 {
        return '' unless @_;
        my $prefix = shift;
        for (@_) {
            chop $prefix while (! /^\Q$prefix\E/);
            }
        return $prefix;
    }
    
    1;
    

    Test suite:

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    
    Test::LCP->runtests;
    
    package Test::LCP;
    
    use base 'Test::Class';
    use Test::More;
    use Benchmark qw(:all :hireswallclock);
    
    sub test_use : Test(startup => 1) {
        use_ok('LCP');
    }
    
    sub test_lcp : Test(6) {
        is( LCP::LCP(),      '',    'Without parameters' );
        is( LCP::LCP('abc'), 'abc', 'One parameter' );
        is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
        is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
            'abcd', 'Some common prefix' );
        my @str = map { chomp; $_ } <DATA>;
        is( LCP::LCP(@str),
            'file:///home/gms8994/Music/', 'Test data prefix' );
        is( LCP::LCP2(@str),
            'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
        my $t = countit( 1, sub{LCP::LCP(@str)} );
        diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
        $t = countit( 1, sub{LCP::LCP2(@str)} );
        diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
    }
    
    __DATA__
    file:///home/gms8994/Music/t.A.T.u./
    file:///home/gms8994/Music/nina%20sky/
    file:///home/gms8994/Music/A%20Perfect%20Circle/
    

    Test suite result:

    1..7
    ok 1 - use LCP;
    ok 2 - Without parameters
    ok 3 - One parameter
    ok 4 - None of common prefix
    ok 5 - Some common prefix
    ok 6 - Test data prefix
    ok 7 - Test data prefix by LCP2
    # LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr +  0.00 sys =  1.09 CPU) @ 20766.06/s (n=22635)
    # LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr +  0.00 sys =  1.07 CPU) @ 16746.73/s (n=17919)
    

    That means that pure Perl solution using substr is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.