Search code examples
pythonalgorithmperlfuzzy-search

Edit distance: Ignore start/end


I am looking for an algorithm that does edit distance, but which will ignore start+end in the one string and white space:

edit("four","foor") = 1
edit("four","noise fo or blur") = 1

Is there an existing algorithm for that? Maybe even a Perl or a Python Library?


Solution

  • The code to do this is simple in concept. It's your idea of what you'd like to ignore that you can add on your own:

    #!perl
    use v5.22;
    use feature qw(signatures);
    no warnings qw(experimental::signatures);
    
    use Text::Levenshtein qw(distance);
    
    say edit( "four", "foor" );
    say edit( "four", "noise fo or blur" );
    
    sub edit ( $start, $target ) {
        # transform strings to ignore what you want
        # ...
        distance( $start, $target )
        }
    

    Maybe you want to check all substrings of the same length:

    use v5.22;
    use feature qw(signatures);
    no warnings qw(experimental::signatures);
    
    use Text::Levenshtein qw(distance);
    
    say edit( "four", "foar" );
    say edit( "four", "noise fo or blur" );
    
    sub edit ( $start, $target ) {
        my $start_length = length $start;
        $target =~ s/\s+//g;
        my @all_n_chars = map {
            substr $target, $_, 4
            } 0 .. ( length($target) - $start_length );
    
        my $closest;
        my $closest_distance = $start_length + 1;
        foreach ( @all_n_chars ) {
            my $distance = distance( $start, $_ );
            if( $distance < $closest_distance ) {
                $closest = $_;
                $closest_distance = $distance;
                say "closest: $closest Distance: $distance";
                last if $distance == 0;
                }
            }
    
        return $closest_distance;
        }
    

    This very simpleminded implementation finds what you want. However, realize that other random strings might accidentally have an edit distance that is lower.

    closest: foar Distance: 1
    1
    closest: nois Distance: 3
    closest: foor Distance: 1
    1
    

    You could extend this to remember the true starting positions of each string so you can find it again in the original, but this should be enough to send you on your way. If you wanted to use Python, I think the program might look very similar.