Search code examples
regexperlprotein-database

Generating a random subset sequences from a fasta file


Hello to Perl Masters in the world.

I have another trouble for programming. I am coding a program which selects random sequences from a proteom fasta file with a certain input number.

A general fasta file looks like this:

>seq_ID_1 descriptions etc ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS SFASGDASGDSSDFDSFSDFSD

>seq_ID_2 descriptions etc ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH ASDSADGDASHDASHSAREWAWGDASHASGASGASG

and so on.......

The letters represent amino acid peptides.

So I have a fasta file with 1000 sequences and want to retrieve 63.21% of them, which will be 632.1 sequenes. But sequence cannot be floting number so if it exceeds 0.5 I want to round up and if less than 0.5 round down.

This is my code for generating random sequence subset, but its slightly not so good at working.

#!/usr/bin/perl

#Selecting 63.21% of random sequnces from a proteom file.
use strict;
use warnings;
use List::Util qw(shuffle);

#Give the first argument as a proteom file.
if (@ARGV != 1)
{
    print "Invalid arguments\n";
    print "Usage: perl randseq.pl [proteom_file]";
    exit(0);
}

my $FILE = $ARGV[0];
my $i = 0;
my %protseq = {};
my $nIdx = 0;

#Extraction and counting of the all headers from a proteom file.
open(LIST,$FILE);
open(TEMP1, ">temp1");
while (my $line = <LIST>){
    chomp $line;
    if ($line =~ />(\S+) (.+)/){
        $i++;
        print TEMP1 $1,"\n";
    }
}
close(LIST);
close(TEMP1);

#Selection of random headers for generating a random subset of the proteom file.
my $GET_LINES = RoundToInt ($i*0.6321);

my @line_starts;
open(my $FH,'<','temp1');
open(TEMP2, ">temp2");

do {
     push @line_starts, tell $FH
} while ( <$FH> );

my $count = @line_starts;

my @shuffled_starts = (shuffle @line_starts)[1..$GET_LINES+1];

for my $start ( @shuffled_starts ) {

     seek $FH, $start, 0
         or die "Unable to seek to line - $!\n";

     print TEMP2 scalar <$FH>;
}
close(TEMP2);

#Assigning the sequence data to randomly generated header file.
open(DATA,'<','temp2');
while(my $line = <DATA>)
{
    chomp($line);
    $line =~ s/[\t\s]//g;
    if($line =~ /^([^\s]+)/)
    {
        $protseq{$1}++;
    }
}
close(DATA);

open(DATA, "$FILE");
open(OUT, ">random_seqs.fasta");
while(my $line = <DATA>)
{
    chomp($line);
    if($line =~ /^>([^\s]+)/)
    {
        if($protseq{$1} ne "")
        {

            $nIdx = 1;
            print OUT "$line\n";
        }
        else
        {
            $nIdx = 0;
        }
    }
    else
    {
        if($nIdx == 1)
        {
            print OUT "$line\n";
        }
    }
}
close(DATA);
close(OUT);

#subroutine for rounding
sub RoundToInt {
  int($_[0] + .5 * ($_[0] <=> 0));
}

system("erase temp1");
system("erase temp2");
exit;

However, it sometimes give the proper number of sequences and sometimes with one more sequence. How can I get rid off that... any ideas please?

or Perhaps better shorter code?

here you can obtains a 75 yeast proteom file. [http://www.peroxisomedb.org/Download/Saccharomyces_cerevisiae.fas][1]

Hope I can fix this soon... :(


Solution

  • Your approach looks fine, just needlessly complicated. I would do it like this:

    use strict;
    use warnings;
    
    # usage: randseq.pl [fraction] < input.fasta > output.fasta
    my $fraction = (@ARGV ? shift : 0.6321);
    
    # Collect input lines into an array of sequences:
    my @sequences;
    while (<>) {
        # A leading > starts a new sequence. (The "\" is only there to
        # avoid confusing the Stack Overflow syntax highlighting.)
        push @sequences, [] if /^\>/;
        push @{ $sequences[-1] }, $_;
    }
    
    # Calculate how many sequences we want:
    my $n = @sequences;
    my $k = int( $n * $fraction + 0.5 );
    warn "Selecting $k out of $n sequences (", 100 * $k / $n, "%).\n";
    
    # Do a partial Fisher-Yates shuffle to select $k random sequences out of $n:
    foreach my $i (0 .. $k-1) {
        my $j = $i + int rand($n-$i);
        @sequences[$i,$j] = @sequences[$j,$i];
    }
    
    # Print the output:
    print @$_ for @sequences[0 .. $k-1];
    

    Note that this code reads the entire contents of the input file into memory. If the input file is too large for that, and you only want a small subset of it, it's possible to use reservoir sampling to select k random sequences out of an arbitrarily large collection of them without holding more:

    use strict;
    use warnings;
    
    my $k = (@ARGV ? shift : 632);  # sample size: need to know this in advance
    
    # Use reservoir sampling to select $k random sequences:
    my @samples;
    my $n = 0;  # total number of sequences read
    my $i;      # index of current sequence
    while (<>) {
        if (/^\>/) {
            # Select a random sequence from 0 to $n-1 to replace:
            $i = int rand ++$n;
            # Save all samples until we've accumulated $k of them:
            $samples[$n-1] = $samples[$i] if $n <= $k;
            # Only actually store the new sequence if it's one of the $k first ones:
            $samples[$i] = [] if $i < $k;
        }
        push @{ $samples[$i] }, $_ if $i < $k;
    }
    
    warn "Only read $n < $k sequences, selected all.\n" if $n < $k;
    warn "Selected $k out of $n sequences (", 100 * $k / $n, "%).\n" if $n >= $k;
    
    # Print sampled sequences:
    print @$_ for @samples;
    

    However, if you really want a certain fraction of the input sequences, you'll need to first count them in a separate pass over the file.

    Both of the programs above also uniformly shuffle the sampled sequences as a side effect. (In fact, I deliberately tweaked the reservoir sampling algorithm to make the shuffle uniform for all values of n and k.) If you don't want that, you can always sort the sequences according to whatever criterion you prefer before printing them.