Search code examples
perlperl-modulelevenshtein-distanceedit-distance

Perl module to check whether a string was created by inserting text into another string


Problem

I have two strings $base and $succ and want to check whether $succ can be created from $base by inserting arbitrary strings at arbitrary positions.

Some examples, written as isSucc($base, $succ):

  • isSucc("abc", "abcX") is true, since we can insert X at the end of abc.
  • isSucc("abc", "XabYcZ") is true, since we can insert XabYcZ.
  • isSucc("abc", "abX") is false, since c was deleted.
  • isSucc("abc", "cab") is false, since c (the one at the end of abc) was deleted.
  • isSucc("abc", "cabc") is true, since we can insert c at the beginning of abc.
  • isSucc("", "anything") is true, since we just have to insert "anything".

We can assume, that $base and $succ do not contain any (vertical) withespace. Usually we have length($base) < length($succ) < 1000. Good performance is not critical but would be nice to have.

I already know one way to implement isSucc*.

Question

  • Is there a perl-module that defines something similar to isSucc?
  • Does someone has an easier/faster/alternative implementation* of isSucc?

* My Approach

Compute the edit/Levenshtein distance with a custom cost model (cost(Insertion)=0, cost(Deletion)=cost(substitution)=1). Then check whether the edit distance is 0.

Comparing the Answers

I wanted to compare the three solutions loop, greedy matching, and non-greedy matching, but the matching methods often took more than 100 times as long as the loop solution so I aborted the tests. Nevertheless – or perhaps exactly for that reason – we have a clear winner: The loop solution.
Big thanks to Christoffer Hammarström.


Solution

  • sub is_subsequence {
        my ($needles, $haystack) = @_;
        my $found = 0;
        for my $needle (split '', $needles) {               # for each character $needle in $needles
            $found = 1 + index $haystack, $needle, $found;  # find it after the previous one in $haystack
            return 0 unless $found;                         # return false if we can't
        }
        return 1;                                           # return true if we found all $needles in $haystack
    }
    

    use Test::More tests => 6;              # 1..6
    is 1, is_subsequence("abc", "abcX");    # ok 1
    is 1, is_subsequence("abc", "XabYcZ");  # ok 2
    is 0, is_subsequence("abc", "abX");     # ok 3
    is 0, is_subsequence("abc", "cab");     # ok 4
    is 1, is_subsequence("abc", "cabc");    # ok 5
    is 1, is_subsequence("", "anything");   # ok 6