Search code examples
stringperldna-sequence

Perl: String in Substring or Substring in String


I'm working with DNA sequences in a file, and this file is formatted something like this, though with more than one sequence:

>name of sequence
EXAMPLESEQUENCEATCGATCGATCG

I need to be able to tell if a variable (which is also a sequence) matches any of the sequences in the file, and what the name of the sequence it matches, if any, is. Because of the nature of these sequences, my entire variable could be contained in a line of the file, or a line of the variable could be a part of my variable. Right now my code looks something like this:

use warnings;
use strict;
my $filename = "/users/me/file/path/file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open file, "<$filename" or die "Can't find file";
my @Name;
my @Sequence;
my $inx = 0;
while (<file>){
    $Name[$inx] = <file>;
    $Sequence[$inx] = <file>;
    $indx++;
}unless(index($Sequence[$inx], $exampleentry) != -1 || index($exampleentry, $Sequence[$inx]) != -1){
    $returnval = "The sequence matches: ". $Name[$inx];
}
print $returnval;

However, even when I purposely set $entry as a match from the file, I still return The sequence does not match any in the file. Also, when running the code, I get Use of uninitialized value in index at thiscode.pl line 14, <file> line 3002. as well as Use of uninitialized value within @Name in concatenation (.) or string at thiscode.pl line 15, <file> line 3002.

How can I perform this search?


Solution

  • I will assume that the purpose of this script is to determine if $exampleentry matches any record in the file file.txt. A record describes here a DNA sequence and corresponds to three consecutive lines in the file. The variable $exampleentry will match the sequence if it matches the third line of the record. A match means here that either

    • $exampleentry is a substring of $line, or
    • $line is a substring of $exampleentry,

    where $line referes to the corresponding line in the file.

    First, consider the input file file.txt:

    >name of sequence
    EXAMPLESEQUENCEATCGATCGATCG
    

    in the program you try to read these two lines, using three calls to readline. Accordingly, that last call to readline will return undef since there are no more lines to read.

    It therefore seems reasonable that the two last lines in file.txt are malformed, and the correct format should be:

    >name of sequence
    EXAMPLESEQUENCE
    ATCGATCGATCG
    

    If I now understand you correctly, I hope this could solve your problem:

    use feature qw(say);
    use strict;
    use warnings;
    
    my $filename = "file.txt";
    my $exampleentry = "ATCG";
    my $returnval = "The sequence does not match any in the file";
    open (my $fh, '<', $filename ) or die "Can't find file: $!";
    my @name;
    my @sequence;
    my $inx = 0;
    while (<$fh>) {
        chomp ($name[$inx] = <$fh>);
        chomp ($sequence[$inx] = <$fh>);
        if (
            index($sequence[$inx], $exampleentry) != -1
            || index($exampleentry, $sequence[$inx]) != -1
        ) {
            $returnval = "The sequence matches: ". $name[$inx];
            last;
        }
    }
    say $returnval;
    

    Notes:

    • I have changed variable names to follow snake_case convention. For example the variable @Name is better written using all lower case as @name.

    • I changed the open() call to follow the new recommended 3-parameter style, see Don't Open Files in the old way for more information.

    • Used feature say instead of print

    • Added a chomp after each readline to avoid storing newline characters in the arrays.