Search code examples
regexperlfasta

Counting sequences in a file beginning and ending with a user defined match


I have a fasta formatted file of DNA sequences called "test.fas":

>test1
GCCATTACAGAACATCAGTCACAGTACGTACTGTGTTCTGCCGTGCTGTCTA
>test2
CGGATGAAGCGCCAATCGTACGTACAATAAGTTGCCTAAAGTGTTTCA
>test3
ATGCATGCATGC

I also have a file of tab delimited primer sequences called "primers.txt":

GCCATTACAGAACATCAGTCACA TAGACAGCACGGCAGAACAC
CGGATGAAGCGCCAATC   TGAAACACTTTAGGCAACTTATT

Each line in this primers.txt file is a primer pair that may match the start and end of a sequence in the fasta file. The second primer on each line also needs to be reverse complemented before it would match anything in the fasta file. Looking at the first primer pair on the first line in primers.txt, after reverse complementing the second primer, it should match the sequence for test1 in the test.fas file.

What I want to be able to do is supply these two files to a perl program, and get an output file of counts for how many times a sequence was found with a primer pair from the primers.txt file. In this case, my outfile would list:

1
1

In reality I have 650000 sequences in a file, and 170 primer sets to search for and enumerate from the file. Therefore, I want an outfile 170 lines long, with each line listing the number of times a match was found in the fasta file for that specific primer pair. Basically, for each line in the primer.txt file, count the number of times a sequence shows up in the fasta file that begins and ends with that primer pair. This is what I have come up with so far:

#!/usr/bin/perl
use strict;
use warnings;

print "Name of the FASTA file: ";
chomp( my $multifasta = <STDIN> );

print "Name file with primers: ";
chomp( my $pulls = <STDIN> );

print "Name of the output file: ";
chomp( my $out = <STDIN> );

open(MULTIFASTA,$multifasta) || die ;
  my $seq = do { local $/; <MULTIFASTA>};
  close MULTIFASTA;

open(PULLS,$pulls) || die;
  while (my $line = <PULLS>){
  chomp $line;
  my @primers = split (/\t/,$line);
  my $revcomp = reverse $primers[1];
  $revcomp =~ tr/ATGCatgc/TACGtacg/;  #reverse complement the reverse primer
  my $matches = () = $seq =~ /^\Q$primers[0].*\Q$primers[1]$/; #How to structure the regex? 
  open(OUTFILE,">>$out");
  print OUTFILE "$matches\n";   
}

My outfile ends up with this:

0
0

I obviously have something screwed up. I am rapidly falling in the trap of trying different things I find on Google without having a firm grasp of what that is doing to the code, and at this point I am lost. That is a consequence of really needing the answer soon, and knowing very little about programming. I gathered from reading that I should be reading in the whole file to scan for matches in with local, and I need to use \Q to search for a variable in a regex in perl. Anyway, any help or pointers would be very much appreciated. Thanks -


Solution

  • Create a regex from all the primers. Also, store the primers in a hash, the values will be the line numbers. Then, iterate over the fasta file and try to match the regex. If it matches, use the hash to retrieve the line number of the primer and use another hash to record the number of matches per line number. At the end, just report the numbers:

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    my ($fasta_file, $primers_file) = @ARGV;
    
    my %primer;
    open my $primers_fh, '<', $primers_file or die $!;
    while (<$primers_fh>) {
        chomp;
        my ($first, $second) = split /\t/;
        $second = reverse $second;
        $second =~ tr/actgACTG/tgacTGAC/;
        undef $primer{$first}{$.};
        undef $primer{$second}{$.};
    }
    
    my $primers_count = $.;
    my $regex =  join '|', keys %primer;
    
    my %seen;
    open my $fasta_fh, '<', $fasta_file or die $!;
    while (<$fasta_fh>) {
        if (/^($regex)/) {
            ++$seen{$_} for keys %{ $primer{$1} };
        }
    }
    
    for my $line_number (sort { $a <=> $b } 1 .. $primers_count) {
        print $seen{$line_number} // 0, "\n";
    }