I'm having some trouble manipulating an array of DNA sequence data that is in .fasta format. What I would specifically like to do is take a file that has a few thousand sequences and adjoin sequence data for each sequence in the file onto a single line in the file. [Fasta format is as such: A sequence ID starts with > after which everything on that line is a description. On the next line(s) the sequence corresponding to this ID is present. And this can continue indefinitely until the next line that begins with >, which is the id of the next sequence in the file] So, in my particular file most of my sequences are on multiple lines, so what I would like to do is essentially remove the newlines, but only the new lines between sequence data, not between sequence data and sequence ID lines (that start with >).
I'm doing this because I want to be able to attain sequence lengths of each sequence (through length, I believe is the easiest way), and then get an average sequence length of all the sequences in the whole file.
Here's my script so far, that doesnt seem to want to work:
#!/usr/bin/perl -w
##Subroutine
sub get_file_data1 {
my($filename) = $_[0];
my @filedata = ();
unless( open(GET_FILE_DATA, $filename)) {
print STDERR "Cannot open file \"$filename\"\n\n";
exit;
}
@filedata = <GET_FILE_DATA>;
close GET_FILE_DATA;
return @filedata;
}
##Opening files
my $fsafile = $ARGV[0];
my @filedata = &get_file_data1($fsafile);
##Procedure
my @count;
my @ids;
my $seq;
foreach $seq (@filedata){
if ($seq =~ /^>/) {push @ids, $seq;
push @count, "\n";
}
else {push @count, $seq;
}
}
foreach my $line (@count) {
if ($line =~ /^[AGTCagtc]/){
$line =~ s/^([AGTCagtc]*)\n/$1/;
}
}
##Make a text file to have a look
open FILE3, "> unbrokenseq.txt" or die "Cannot open output.txt: $!";
foreach (@count)
{
print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;
__END__
##Creating array of lengths
my $number;
my @numberarray;
foreach $number (@count) {
push @numberarray, length($number);
}
print @numberarray;
__END__
use List::Util qw(sum);
sub mean {
return sum(@numberarray)/@numberarray;
}
There's something wrong with the second foreach line of the Procedure section and I can't seem to figure out what it is. Note that the code after the END lines I haven't even tried yet because I cant seem to get the code in the procedure step to do what I want. Any idea how I can get a nice array with elements of unbroken sequence (I've chosen to just remove the sequence ID lines from the new array..)? When I can then get an array of lengths, after which I can then average?
Finally I should unfortunately admit that I cannot get Bio::Perl working on my computer, I have tried for hours but the errors are beyond my skill to fix. Ill be talking to someone who can hopefully help me with my Bio::perl issues. But for now I'm just going to have to press on without it.
Thanks! Sorry for the length of this post, I appreciate the help.
Andrew
The problem with your second loop is that you are not actually changing anything in @count
because $line
contains a copy of the values in @count
.
But, if all you want to do in the second loop is to remove the newline character at the end, use the chomp
function. with this you wouldn't need your second loop. (And it would also be faster than using the regex.)
# remove newlines for all array elements before doing anything else with it
chomp @filedata;
# .. or you can do it in your first loop
foreach $seq (@filedata){
chomp $seq;
if ($seq =~ /^>/) {
...
}
An additional tip: Using get_file_data1
to read the entire file into an array might be slow if your files are large. In that case it would be better to iterate through the file as you go:
open my $FILE_DATA, $filename or die "Cannot open file \"$filename\"\n";
while (my $line = <$FILE_DATA>) {
chomp $line;
# process the record as in your Procedure section
...
}
close $FILE_DATA;