Search code examples
arraysperlloopsbioinformaticsedit-distance

Percentage edit distance from array


I am attempting to get a percentage of an edit distance from a group of sequences. So far this is what I have:

#!/usr/bin/perl -w
use strict;
use Text::Levenshtein qw(distance);

my @sequence = qw(CA--------W----------------------EKDRRTEAF---F------ 
CA--------W----------------------EKDRRTEAF---F------ 
CA--------S-------------------SLVFGQGDNIQY---F------  
RA--------S-------------------SLIYSP----LH---F------);


foreach my $list (@sequence){
    my @distance = distance($list, @sequence);
    my @length = $list =~ tr/[A-Z]///;
}

I am able to get the edit distance with @distance and the length of each sequence, based on the letters with @length. If printed the results are as follows:

@distance

0 0 13 14
0 0 13 14
13 13 0 11 
14 14 11 0

@length

13
13
16
12

As each line of @length is equivalent to each line of @sequence, when comparing the two lines I would like to use the largest @length to get the percentage. So as when having an edit distance between the second and third sequence it would use the length of 16 rather than 13 to get a percentage. What I think needs to happen is to call only two elements of the @length array and pick the larger one to then put into a percentage, possibly using an if statement.

I know this code is wrong, but it is generally the idea I am going for:

foreach my  $list (@sequence){
        my @distance = distance($list, @sequence);      
        my @length = $list =~ tr/[A-Z]//;                # / syntax hilite fix

        foreach my $item(@distance){
                foreach @length {
                        my $num1 = if $length[0] >= $length[1];
                                 print "$item/$num1\n";
                        else my $num2 = $length[1] >= $length[0];
                                print "$item/$num2\n";
                }
        }
}

The answer should look something similar to that below:

0  0 .8125  1.0769
0  0  .8125  1.0769
.8125  .8125  0  .6875
1.0769  1.0769  .6875  0

Solution

  • Try this. To summarize: We compute the edit distances between pairs of strings. For each pair we want to determine the fraction of the distance and the maximum number of characters (A-Z). The maximum number of characters is taken to be the maximum for the two items in the pair.

    use strict;
    use warnings;
    
    use Text::Levenshtein qw(distance);
    
    my @sequence = qw(
            CA--------W----------------------EKDRRTEAF---F------
            CA--------W----------------------EKDRRTEAF---F------
            CA--------S-------------------SLVFGQGDNIQY---F------
            RA--------S-------------------SLIYSP----LH---F------
    );
    
    my @length = map { tr/[A-Z]// } @sequence;
    
    for my $i (0..$#sequence) {
        my $list = $sequence[$i];
        my @distance = distance($list, @sequence);
        my $num1 = $length[$i];
        for my $j (0..$#distance) {
            my $item = $distance[$j];
            my $num2 = $length[$j];
            my $num = ( $num2 > $num1 ) ? $num2 : $num1;
            printf "%.4f ", $item/$num;
        }
        print "\n";
    }
    

    Output:

    0.0000 0.0000 0.8125 1.0769 
    0.0000 0.0000 0.8125 1.0769 
    0.8125 0.8125 0.0000 0.6875 
    1.0769 1.0769 0.6875 0.0000