I'm working with DNA sequences in a file, and this file is formatted something like this, though with more than one sequence:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
I need to be able to tell if a variable (which is also a sequence) matches any of the sequences in the file, and what the name of the sequence it matches, if any, is. Because of the nature of these sequences, my entire variable could be contained in a line of the file, or a line of the variable could be a part of my variable. Right now my code looks something like this:
use warnings;
use strict;
my $filename = "/users/me/file/path/file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open file, "<$filename" or die "Can't find file";
my @Name;
my @Sequence;
my $inx = 0;
while (<file>){
$Name[$inx] = <file>;
$Sequence[$inx] = <file>;
$indx++;
}unless(index($Sequence[$inx], $exampleentry) != -1 || index($exampleentry, $Sequence[$inx]) != -1){
$returnval = "The sequence matches: ". $Name[$inx];
}
print $returnval;
However, even when I purposely set $entry as a match from the file, I still return The sequence does not match any in the file
. Also, when running the code, I get Use of uninitialized value in index at thiscode.pl line 14, <file> line 3002.
as well as Use of uninitialized value within @Name in concatenation (.) or string at thiscode.pl line 15, <file> line 3002.
How can I perform this search?
I will assume that the purpose of this script is to determine if $exampleentry
matches any record in the file file.txt
. A record describes here a DNA sequence and corresponds to three consecutive lines in the file. The variable $exampleentry
will match the sequence if it matches the third line of the record. A match means here that either
$exampleentry
is a substring of $line
, or$line
is a substring of $exampleentry
, where $line
referes to the corresponding line in the file.
First, consider the input file file.txt
:
>name of sequence
EXAMPLESEQUENCEATCGATCGATCG
in the program you try to read these two lines, using three calls to readline
. Accordingly, that last call to readline will return undef
since there are no more lines to read.
It therefore seems reasonable that the two last lines in file.txt
are malformed, and the correct format should be:
>name of sequence
EXAMPLESEQUENCE
ATCGATCGATCG
If I now understand you correctly, I hope this could solve your problem:
use feature qw(say);
use strict;
use warnings;
my $filename = "file.txt";
my $exampleentry = "ATCG";
my $returnval = "The sequence does not match any in the file";
open (my $fh, '<', $filename ) or die "Can't find file: $!";
my @name;
my @sequence;
my $inx = 0;
while (<$fh>) {
chomp ($name[$inx] = <$fh>);
chomp ($sequence[$inx] = <$fh>);
if (
index($sequence[$inx], $exampleentry) != -1
|| index($exampleentry, $sequence[$inx]) != -1
) {
$returnval = "The sequence matches: ". $name[$inx];
last;
}
}
say $returnval;
Notes:
I have changed variable names to follow snake_case convention. For example the variable @Name
is better written using all lower case as @name
.
I changed the open()
call to follow the new recommended 3-parameter style, see Don't Open Files in the old way for more information.
Added a chomp after each readline to avoid storing newline characters in the arrays.