Search code examples
perldesign-patternsfasta

Find pattern mulifasta list based on user input in Perl


I am trying to fix a Perl code. Given the following "file.txt":

>otu1  
AACGCCTTTCCNGGATGGCAAAATTTNTNGTAAA
AGGGCACCCANTTCTGGCTCGAAA  
>otu2
NNAATCGGNNNGGGGCGTAANGAGGTTNCGGCACGG
TNCCCGTTTANCG
>otu3   
CTGGNATAAAAAANNNNTACTTAA

After providing a otu number as argument (i.e. otu2) when calling the program, I want to first (1) check if that otu is present in the file.txt, then (2) find the pattern [NC].[CT] (element N or C, followed by any element . and followed by an element C or T) within the otu sequence, and finally (3) print out the start‐ and end‐position of every site.

For the first two questions I am trying with the following code but I am encountering problems by integrating subroutines.


#!/usr/bin/perl -w

use warnings;
use strict;

$otu = $ARGV[0];   
check_otu("file.txt");

sub check_otu {
    my $content = shift;
    open(my $fh, '<' , $content) || die "File not found: $!";
    my $content;    

    while( my $line = <$fh> ) {
        if ( $line =~ /^>/ ) {
            check_pattern($content) if $content=$otu;
            $content = $line;
        }
        else {
            $content .= $line;
        }
    }
    check_motifs($content);
}

}
sub check_pattern{
    my $fasta = $content;
    $count++ if count_pattern($fasta);
}
sub count_pattern {
    my $chain = $content;
    my @all = $chain =~ /([NC].[CT])/g;
    scalar @all;
}

I got these errors:

"my" variable $content masks earlier declaration in same scope at proof.pl line 12.
Name "main::count" used only once: possible typo at proof.pl line 28.
Undefined subroutine &main::check_motifs called at proof.pl line 23, <$fh> line 8.

Would you have any suggestion? Any hint for the third question? Thanks for your help


Solution

  • bioperl makes it easy to read fasta files. Use it instead of trying to re-invent the wheel.

    The special variables @- and @+ hold the indexes of the start and end of the last matched pattern (And any capturing groups inside it). You'll need that for your third bit.

    You might end up with something like:

    #!/usr/bin/env perl
    use warnings;
    use strict;
    use feature qw/say/;
    use Bio::SeqIO;
    
    my ($file, $otu) = @ARGV;
    my $fasta = Bio::SeqIO->new(-file => $file, -format => 'fasta');
    my $found = 0;
    
    while (my $seq = $fasta->next_seq()) {
      next unless $seq->primary_id() eq $otu;
      $found = 1;
      my $s = $seq->seq();
      while ($s =~ m/[NC].[CT]/g) {
        my $start = $-[0];
        my $stop = $+[0] - 1; # Index in this array is 1 past the last character
        say "$start $stop";
      }
    }
    
    say "$otu not found" unless $found;
    

    Example:

    $ perl otu.pl sample.fasta otu2
    15 17
    31 33
    37 39
    40 42