Search code examples
perlfastadna-sequenceblast

Perl Program error


I wrote a PERL program which takes an excel sheet (coverted to a text file by changing the extension from .xls to .txt) and a sequence file for its input. The excel sheet contains the start point and the end point of an area in the sequence file (along with 70 flanking values on either side of the match area) that needs to cut and extracted into a third output file. There are like 300 values. The program reads in the start point and the end point of the sequence that needs to be cut each time but it repeatedly tells me that the value is outside the length on the input file when it clearly isn't. I just cant seem to get this fixed

This is the program

use strict;
use warnings;

my $blast;
my $i;
my $idline;
my $sequence;
print "Enter Your BLAST result file name:\t";
chomp( $blast = <STDIN> );    # BLAST result file name
print "\n";

my $database;
print "Enter Your Gene list file name:\t";
chomp( $database = <STDIN> );    # sequence file
print "\n";

open IN, "$blast" or die "Can not open file $blast: $!";

my @ids       = ();
my @seq_start = ();
my @seq_end   = ();

while (<IN>) {

    #spliting the result file based on each tab
    my @feilds = split( "\t", $_ );
    push( @ids, $feilds[0] );    #copying the name of sequence
         #coping the 6th tab value of the result which is the start point of from where a value should be cut.
    push( @seq_start, $feilds[6] );
    #coping the 7th tab value of the result file which is the end point of a value should be cut.
    push( @seq_end, $feilds[7] );
}
close IN;

open OUT, ">Result.fasta" or die "Can not open file $database: $!";

for ( $i = 0; $i <= $#ids; $i++ ) {

    ($sequence) = &block( $ids[$i] );

    ( $idline, $sequence ) = split( "\n", $sequence );

    #extracting the sequence from the start point to the end point
    my $seqlen = $seq_end[$i] - $seq_start[$i] - 1;

    my $Nucleotides = substr( $sequence, $seq_start[$i], $seqlen );  #storing the extracted substring into $sequence

    $Nucleotides =~ s/(.{1,60})/$1\n/gs;

    print OUT "$idline\n";
    print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";

sub block {
    #block for id storage which is the first tab in the Blast output file.
    my $id1 = shift;
    print "$id1\n";
    my $start = ();

    open IN3, "$database" or die "Can not open file $database: $!";

    my $blockseq = "";
    while (<IN3>) {

        if ( ( $_ =~ /^>/ ) && ($start) ) {

            last;
        }

        if ( ( $_ !~ /^>/ ) && ($start) ) {

            chomp;
            $blockseq .= $_;
        }

        if (/^>$id1/) {

            my $start = $. - 1;
            my $blockseq .= $_;
        }
    }
    close IN3;

    return ($blockseq);
}

BLAST RESULT FILE: http://www.fileswap.com/dl/Ws7ehftejp/

SEQUENCE FILE: http://www.fileswap.com/dl/lPwuGh2oKM/

Error

substr outside of string at Nucleotide_Extractor.pl line 39.

Use of uninitialized value $Nucleotides in substitution (s///) at Nucleotide_Extractor.pl line 41.

Use of uninitialized value $Nucleotides in concatenation (.) or string at Nucleotide_Extractor.pl line 44.

Any help is very much appreciated and queries are always invited


Solution

  • There were several problems with the existing code, and I ended up rewriting the script while fixing the errors. Your implementation isn't very efficient as it opens, reads, and closes the sequence file for every ID in your Excel sheet. A better approach would be either to read and store the data from the sequence file, or, if memory is limited, go through each entry in the sequence file and pick out the corresponding data from the Excel file. You would also be better off using hashes, instead of arrays; hashes store data in key -- value pairs, so it is MUCH easier to find what you're looking for. I have also used references throughout, as they make it easy to pass data into and out of subroutines.

    If you are not familiar with perl data structures, check out perlfaq4 and perldsc, and perlreftut has information on using references.

    The main problem with your existing code was that the subroutine to get the sequence from the fasta file was not returning anything. It is a good idea to put plenty of debugging statements in your code to ensure that it is doing what you think it is doing. I've left in my debugging statements but commented them out. I've also copiously commented the code that I changed.

    #!/usr/bin/perl
    use strict;
    use warnings;
    # enables 'say', which prints out your text and adds a carriage return
    use feature ':5.10';
    # a very useful module for dumping out data structures
    use Data::Dumper;
    
    #my $blast = 'infesmall.txt';
    print "Enter Your BLAST result file name:\t";
    chomp($blast = <STDIN>);     # BLAST result file name
    print "\n";
    
    #my $database = 'infe.fasta';
    print "Enter Your Gene list file name:\t";
    chomp($database = <STDIN>);  # sequence file
    print "\n";
    
    open IN,"$blast" or die "Can not open file $blast: $!";
    
    # instead of using three arrays, let's use a hash reference!
    # for each ID, we want to store the start and the end point. To do that,
    # we'll use a hash of hashes. The start and end information will be in one
    # hash reference:
    # { start => $fields[6], end => $fields[7] }
    # and we will use that hashref as the value in another hash, where the key is
    # the ID, $fields[0]. This means we can access the start or end data using
    # code like this:
    #   $info->{$id}{start}
    #   $info->{$id}{end}
    my $info;
    
    while(<IN>){
        #splitting the result file based on each tab
        my @fields = split("\t",$_);
        # add the data to our $info hashref with the ID as the key:
        $info->{ $fields[0] } = { start => $fields[6], end => $fields[7] };
    }
    close IN;
    
    #say "info: " . Dumper($info);
    
    # now read the sequence info from the fasta file
    my $sequence = read_sequences($database);
    #say "data from read_sequences:\n" . Dumper($sequence);
    
    my $out = 'result.fasta';
    open(OUT, ">" . $out) or die "Can not open file $out: $!";
    
    foreach my $id (keys %$info) {
    
        # check whether the sequence exists
        if ($sequence->{$id}) {
            #extracting the sequence from the start point to the end point
            my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1;
    
            #say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end};
    
            #storing the extracted substring into $sequence
            my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen);
            $nucleotides =~ s/(.{1,60})/$1\n/gs;
            #say "nucleotides: $nucleotides";
            print OUT $sequence->{$id}{header} . "\n";
            print OUT "$nucleotides\n";
        }
    }
    print "\nExtraction Completed...";
    
    sub read_sequences {
        # fasta file
        my $fasta_file = shift;
    
        open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!";
    
        # initialise two variables. We will store our sequence data in $fasta
        # and use $id to track the current sequence ID
        # the $fasta hash will look like this:
        # $fasta = {
        #   'gi|7212472|ref|NC_002387.2' => {
        #       header => '>gi|7212472|ref|NC_002387.2| Phytophthora...',
        #       seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...',
        #   }
        #}
    
        my ($fasta, $id);
    
        while(<IN3>){
            chomp;
            if (/^>/) {
                if (/^>(\S+) /){
                    # the header line with the sequence info.
                    $id = $1;
                    # save the data to the $fasta hash, keyed by seq ID
                    # we're going to build up an entry as we go along
                    # set the header to the current line
                    $fasta->{ $id }{ header } = $_;
                }
                else {
                    # no ID found! Erk. Emit an error and undef $id.
                    warn "Formatting error: $_";
                    undef $id;
                }
            }
            ## ensure we're getting sequence lines...
            elsif (/^[ATGC]/) {
                # if $id is not defined, there's something weird going on, so
                # don't save the sequence. In a correctly-formatted file, this
                # should not be an issue.
                if ($id) {
                    # if $id is set, add the line to the sequence.
                    $fasta->{ $id }{ seq } .= $_;
                }
            }
        }
        close IN3;
        return $fasta;
    }