Search code examples
stringperlmatchdna-sequence

Perl: Return Highest Percent Match for Strings


I have a DNA sequence, like ATCGATCG for example. I also have a database of DNA sequences formatted as follows:

>Name of sequence1
SEQUENCEONEEXAMPLEGATCGATC
>Name of sequence2
SEQUENCETWOEXAMPLEGATCGATC

(So the odd numbered lines contain a name, and the even numbered lines contain a sequence) Currently, I search for perfect matches between my sequence and sequences in the database as follows (assume all the variables are declared):

my $name;
my $seq;
my $returnval = "The sequence does not match any in database";
open (my $database, "<", $db1) or die "Can't find db1";
until (eof $database){
    chomp ($name = <$database>);
    chomp ($seq = <$database>);
    if (
        index($seq, $entry) != -1
        || index($entry, $seq) != -1
    ) {
        $returnval = "The sequence matches: ". $name;
        last;
    }
}
close $database;

Is there any way for me to return the name of the highest percentage matched sequence as well as percent match there is between the entry and the sequence in the database?


Solution

  • String::Similarity returns the similarity between strings as a value between 0 and 1, 0 being completely dissimilar and 1 being exactly the same.

    my $entry = "AGGUUG" ;
    my $returnval;
    my $name;
    my $seq;
    my $currsim;
    my $highestsim = 0;
    my $highestname;
    open (my $database, "<", $db1) or die "Can't find db1";
    until (eof $database){
        chomp ($name = <$database>);
        chomp ($seq = <$database>);
        $currsim = similarity $entry, $seq, $highestsim;
        if ($currsim > $highestsim) {
            $highestsim = $currsim;
            $highestname = $name;
        }
    }
    $highestsim = $highestsim * 100;
    my @names = split(/>/, $highestname);
    $returnval = "This sequence matches " . $names[1] . " the best with " . $highestsim . "% similarity";
    close $database;