I have a for loop with nested if, else, and elsif statements within. The for loop runs correctly, but it only runs once for some reason. I am looking to count the A's, C's, G's, and T's in a sequence, but I want to count them in two groups - a motif group and a background group. The motif group counts need to be position specific while the background counts do not.
Here is what is contained in my .dna file (a .txt would work fine): AGGCT
Here is what I have so far:
use strict;
use warnings;
#Upload sequence
print "Please enter the filename of the first sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and ignore comment lines
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
for (<FILE>)
{
next if /^#/;
next if /^>/;
$dna .= $_;
}
close FILE;
#Remove white spaces
$dna =~ s/[\s\d]//g;
$dna =~ /./g;
#User specifies motif width
print "Please enter the motif width:\n";
my $width = <STDIN>;
#Remove newline from file
chomp $width;
#Omitting code for non-negative widths to keep this shorter
#Initialize counts and arrays for motif positions
my @motA;
my @motC;
my @motG;
my @motT;
#Define length of motif arrays per width
for(0..($width-1))
{
$motA[$_] = 0;
$motC[$_] = 0;
$motG[$_] = 0;
$motT[$_] = 0;
}
#Initialize background counts
my $bgA = 0;
my $bgC = 0;
my $bgG = 0;
my $bgT = 0;
#Generate random start site in the sequence
#for motif to start from
my $ms = int(rand(((length($dna)+1)-$width)));
#Within a motif, count the bases at the positions
for (my $pos = 0..(length($dna)-1))
{
my $base = substr($dna, $pos, 1);
if ($pos = $ms..($ms + $width))
{
#Add to motif counts
if($base eq 'A')
{
$motA[$pos-$ms] = $motA[$pos-$ms] + 1;
}
elsif($base eq 'C')
{
$motC[$pos-$ms] = $motC[$pos-$ms] + 1;
}
elsif($base eq 'G')
{
$motG[$pos-$ms] = $motG[$pos-$ms] + 1;
}
elsif($base eq 'T')
{
$motT[$pos-$ms] = $motT[$pos-$ms] + 1;
}
}
else
{
#Create background counts
if ($base eq 'A')
{
$bgA = $bgA + 1;
}
elsif ($base eq 'C')
{
$bgC = $bgC + 1;
}
elsif ($base eq 'G')
{
$bgG = $bgG + 1;
}
elsif ($base eq 'T')
{
$bgT = $bgT + 1;
}
}
}
print "A @motA\nC @motC\nG @motG\nT @motT\n\n";
print "bgA = $bgA\n
bgC = $bgC\n
bgG = $bgG\n
bgT = $bgT";
The output looks like this:
Please enter the filename of the first sequence data: sample.dna
Please enter the motif width:
3
Argument "" isn't numeric in substr at line 62, <STDIN> line2.
A 0 1 0
C 0 0 0
G 0 0 0
T 0 0 0
bgA = 0
bgC = 0
bgG = 0
bgT = 0
I know that this is most likely because my $dna or $pos in the line with substr contains the "" (empty string?), but I am unsure how to solve this problem. I thought the initialization of $pos took care of that, but that's why I want to ask the masters to see what to do. I THINK that this will solve the for loop problem as well. As always, any and all help is useful. I appreciate it in advance!
This:
for (my $pos = 0..length($dna))
{
my $base = substr($dna, $pos, 1);
probably is meant to be 0..length($dna)-1
instead?
When $pos is the length, the substring is going to be an empty string.
And that's not the proper syntax for a for loop iterating over a range. It should be
for my $pos (0..length($dna)-1)
This:
if ($pos = $ms..($ms + $width))
if I understand correctly should be
if ($pos >= $ms && $pos < $ms + $width)
What you have is assigning to $pos the result of a flipflop operation, which is not going to be anything useful.
It looks like this:
my $ms = int(rand(((length($dna)+1)-$width)));
should be
my $ms = int(rand(((length($dna))-$width)));
E.g. if the $dna length is 10 and width is 3, you want the possible starting offsets to be 0 through 7, not 0 through 8.
And it looks like your counting within the motif should be using the position within the motif, not $pos; this:
$motA[$pos] = $motA[$pos] + 1;
should be
$motA[$pos-$ms] = $motA[$pos-$ms] + 1;