perl hashmap bioinformatics perl-data-structures bioperl

Editing help with perl script to start and stop at specific places within an array

Looking for troubleshooting and editing help. This is a homework assignment. My professor encourages the use of forums. I don't have experience with Perl Functions or Subs yet so please limit responses to the appropriate level so I can understand.

The purpose of the script is to read a string of DNA (or file from command line which I will add later), translate it into RNA, and then return the value of the protein in the form of uppercase one-letter amino acid names.

The function of the script:

Take 3 character "codons" from the first character and give them a single letter Symbol (an uppercase one-letter amino acid name from the hash table)

Print RNA Proteins which are strings that start with the AUG ("M") and ends with UAG, UAA or UGA.

If a gap is encountered a new line is started and process is repeated. We can assume that gaps are multiples of threes.

Main problems as far as I can tell:

I don't know where to have the data loop through the hash table. I've tried placing it before and after my Foreach block. I've also taken the Foreach block out altogether and tried While & If.

The Foreach block doesn't seem to be processing all of the @all_codons array and only stopping at AUG.

The obvious and biggest problem is that it's returning nothing. Somewhere along the way the $next_codon value is being assigned "false". I've tried commenting each line out piece by piece - last line that returned anything was My $start and from there on it's all false.

The Script:

$^W = 1;
use strict;


my $dna_string = "CCCCAAATGCTGGGATTACAGGCGTGAGCCACCACGCCCGGCCACTTGGCATGAATTTAATTCCCGCCATAAACCTGTGAGATAGGTAATTCTGTTATATCCACTTTACAAATGAAGAGACTGAGGCAAAGAAAGATGATGTAACTTACGCAAAGC";

my %codon_codes = (
    "UUU" => "f", "UUC" => "f", "UUA" => "l", "UUG" => "l",
    "CUU" => "l", "CUC" => "l", "CUA" => "l", "CUG" => "l",
    "AUU" => "i", "AUC" => "i", "AUA" => "i", "AUG" => "m",
    "GUU" => "v", "GUC" => "v", "GUA" => "v", "GUG" => "v",
    "UCU" => "s", "UCC" => "s", "UCA" => "s", "UCG" => "s",
    "CCU" => "p", "CCC" => "p", "CCA" => "p", "CCG" => "p",
    "ACU" => "t", "ACC" => "t", "ACA" => "t", "ACG" => "t", 
    "GCU" => "a", "GCC" => "a", "GCA" => "a", "GCG" => "a",
    "UAU" => "y", "UAC" => "y", "UAA" => " ", "UAG" => " ",
    "CAU" => "h", "CAC" => "h", "CAA" => "q", "CAG" => "q",
    "AAU" => "n", "AAC" => "n", "AAA" => "k", "AAG" => "k"
 );

my $rna_string = $dna_string;
$rna_string =~ tr/[tT]/U/;

my @all_codons = ($rna_string =~ m/.../g);

foreach my $next_codon(@all_codons){
            
    while ($next_codon =~ /AUG/gi){
            
        my $start = pos ($next_codon) -3;
    
        last unless $next_codon =~ /U(AA|GA|AG)/gi;
    
        my $stop = pos($next_codon);
            
        my $genelen = $stop - $start;
            
        my $gene = substr ($next_codon, $start, $genelen);
            
        print "\n" . join($start+1, $stop, $gene,) . "\n";
    }
}

Solution

I don't understand the 'data loop through the hash table' part.

It seems to me that, for each codon, you need to check whether it is a start codon, a stop codon, a gap or an amino-acid. And you need to some way to keep state (below as $in_gene).

my $in_gene = 0;

foreach my $next_codon(@all_codons){
    if ($next_codon eq 'AUG') {
        $in_gene = 1;
    }
    elsif ($next_codon =~ m/U(AA|GA|AG)/) {
        $in_gene = 0;
    }
    elsif ($in_gene == 1) {
        my $aminoacid = $codon_codes{$next_codon};
        print "\n" and next unless defined $aminoacid;
        print $aminoacid;
    }
}

This prints

l
lqak
l
q
k