Search code examples
perlsequencesoverlapping-matches

How do I counting overlapping dimers for multiple sequences?


I have to count the number of overlapping dimers (AA, AG, AC, AT, GA, GG, GC, GT, CC, CG, CA, CT, TT, TA, TG, TC) in multiple sequences using Perl. I wrote the following code but it only works for one sequence. How can I extend it to multiple sequences?

#!/usr/bin/perl -w
open FH, "sample.txt";
$genome=<FH>;
%count=();
$count{substr($genome, $_, 2)}++ for 0..length($genome)-2;
print "AA: $count{AA}\n";
print "AG: $count{AG}\n";
print "AC: $count{AC}\n";
print "AT: $count{AT}\n";

print "TA: $count{TA}\n";
print "TG: $count{TG}\n";
print "TC: $count{TC}\n";
print "TT: $count{TT}\n";

print "GA: $count{GA}\n";
print "GG: $count{GG}\n";
print "GC: $count{GC}\n";
print "GT: $count{GT}\n";

print "CA: $count{CA}\n";
print "CG: $count{CG}\n";
print "CC: $count{CC}\n";
print "CT: $count{CT}\n";

I need:

  1. counts for each sequence and
  2. the total counts

input example: sample.txt

ATGGGCTCCTCCGCCATCACCGTGAGCTTCCTCCTCTTTCTGGCATTTCAGCTCCCAGGGCAAACAGGAGCAAATCCCGTGTATGGCTCTGTGTCCAATGCAGACCTGATGGATTTCAAGTAAAAG
ATGGTGAGAAAATGGGCCCTGCTCCTGCCCATGCTGCTCTGCGGCCTGACTGGTCCCGCACACCTCTTCCAGCCAAGCCTGGTGCTGGAGATGGCCCAGGTCCTCTTGGATAACTACTGCTTCCCAGAGAACCTGATGGGGATGCAGGGAGCCATCGAGCAGGCCATCAAAAGTCAGGAGATTCTGTCTATCTCAGACCCTCAGACTCTGGCCCATGTGTTGACAGCTGGGGTGCAGAGCTCCTTGAATGACCCTCGCCTGGTCATCTCCTATGAGCCCAGCACCCTCGAGGCCCCTCCGCGAGCTCCAGCAGTCACGAACCTCACACTAGAGGAAATCATCGCAGGGCTGCAGGATGGCCTTCGCCATGAGATTCTGGAAGGCAATGTGGGCTACCTGCGGGTGGACGACATCCCGGGCCAGGAGGTGATGAGCAAGCTGAGGAGCTTCCTGGTGGCCAACGTCTGGAGGAAGCTCGTGAACACGTCCGCCTTGGTGCTGGACCTCCGGCACTGCACTGGGGGACACGTGTCTGGCATCCCCTATGTCATCTCCTACCTGCACCCAGGGAGCACAGTCTCGCACGTGGACACCGTCTACGACCGCCCCTCCAACACAACCACTGAGATCTGGACCCTGCCTGAAGCCCTGGGAGAGAAGTACAGTGCAGACAAGGATGTGGTGGTCCTCACCAGCAGCCGCACGGGGGGCGTGGCTGAGGACATCGCTTACATCCTCAAACAGATGCGCAGGGCCATCGTGGTGGGCGAGCGGACTGTTGGGGGGGCTCTGAACCTCCAGAAGCTGAGGGTAGGCCAGTCCGACTTCTTTCTCACTGTGCCTGTGTCCAGATCCCTGGGGCCCCTGGGTGAGGGCAGCCAGACGTGGGAGGGCAGTGGGGTGCTGCCCTGTGTGGGGACACCGGCCGAGCAGGCCCTGGAGAAAGCCCTGGCCGTTCTCATGCTGCGCAGGGCCCTGCCAGGAGTCATTCAGCGCCTTCAGGAGGCGCTGCGCGAGTACTACACGCTGGTGGACCGTGTGCCCGCCCTGCTGAGCCACCTGGCCGCCATGGACCTGTCCTCGGTGGTCTCCGAGGACGATCTGGTCACTAAGCTCAATGCTGGCCTGCAGGCTGTGTCTGAGGACCCCAGGCTCCAGGTGCAGGTGGTCAGACCCAAAGAAGCCTCTTCTGGGCCTGAGGAAGAAGCTGAAGAACCTCCAGAGGCGGTCCCGGAAGTGCCCGAGGACGAGGCTGTTCGGCGGGCTCTGGTGGACTCCGTGTTCCAGGTTTCTGTGCTGCCGGGCAACGTGGGCTACCTGCGCTTCGACAGTTTCGCTGATGCCTCTGTCCTGGAGGTGCTGGGCCCCTACATCCTGCACCAGGTGTGGGAGCCCCTGCAGGACACGGAGCACCTCATCATGGACCTGCGGCAGAACCCCGGGGGGCCGTCCTCCGCGGTGCCCCTGCTGCTCTCCTACTTCCAGAGCCCTGACGCCAGCCCCGTGCGCCTCTTCTCCACCTACGACCGGCGCACCAACATCACACGCGAGCACTTCAGCCAGACGGAGCTGCTGGGCCGGCCCTACGGCACCCAGCGTGGCGTGTACCTGCTCACTAGCCACCGCACCGCCACCGCGGCCGAGGAGCTGGCCTTCCTCATGCAGTCACTGGGCTGGGCCACGCTGGTGGGCGAGATCACCGCGGGCAGCCTGCTGCACACACACACAGTATCCCTGCTGGAGACGCCCGAGGGCGGCCTGGCGCTCACGGTGCCTGTGCTCACCTTCATCGACAACCATGGCGAGTGCTGGCTGGGGGGCGGTGTGGTCCCCGATGCCATTGTGCTGGCCGAGGAAGCCCTAGACAGAGCCCAGGAGGTGCTGGAGTTCCACCGAAGCTTGGGGGAGTTGGTGGAAGGCACGGGGCGCCTGCTGGAGGCCCACTACGCTCGGCCAGAGGTCGTGGGGCAGATGGGTGCCCTGCTGCGAGCCAAGCTGGCCCAGGGGGCCTACCGCACCGCGGTGGACCTGGAGTCGCTGGCTTCCCAGCTTACGGCCGACCTGCAGGAGATGTCTGGGGACCACCGTCTGCTGGTGTTCCACAGCCCCGGCGAAATGGTGGCTGAGGAGGCGCCCCCACCGCCTCCCGTCGTCCCCTCCCCGGAGGAGCTGTCCTATCTCATCGAGGCCCTGTTCAAGACTGAGGTGCTGCCCGGCCAGCTGGGCTACCTGCGTTTCGACGCCATGGCTGAGCTGGAGACGGTGAAGGCCGTCGGGCCACAGCTGGTGCAGCTGGTGTGGCAGAAGCTGGTGGACACGGCCGCGCTGGTGGTCGACCTGCGCTACAACCCCGGCAGCTACTCCACAGCCGTGCCTCTACTCTGCTCCTACTTCTTCGAGGCAGAGCCCCGCCGGCACCTCTACTCTGTCTTTGACAGGGCCACGTCAAGGGTCACAGAGGTATGGACCCTGCCCCACGTTACAGGCCAGCGCTATGGCTCCCACAAGGACCTCTACGTTCTGGTGAGCCACACCAGCGGTTCAGCAGCTGAGGCTTTTGCTCACACCATGCAGGATCTGCAGCGAGCCACCATCATCGGGGAGCCCACGGCCGGAGGGGCACTCTCCGTGGGAATCTACCAGGTGGGCAGCAGCGCCTTATACGCCTCCATGCCCACGCAGATGGCCATGAGTGCCAGCACCGGCGAGGCCTGGGATCTGGCTGGGGTGGAGCCGGACATCACTGTGCCCATGAGCGTGGCCCTCTCCACAGCCCGGGACATAGTGACCCTGCGTGCCAAGGTGCCCACTGTGCTGCAGACAGCTGGGAAGCTCGTAGCGGATAACTACGCCTCCCCTGAGCTGGGAGTCAAGATGGCAGCCGAACTGAGTGGTCTGCAGAGCCGCTATGCCAGGGTGACCTCAGAAGCCGCCCTCGCCGAGCTGCTGCAAGCCGACCTGCAGGTGCTGTCCGGGGACCCACACCTGAAGACAGCTCATATACCTGAGGATGCCAAAGACCGCATTCCTGGCATTGTACCCATGCAGTAACAG
ATGGACATGATGGACGGCTGCCAGTTCTCGCCCTCTGAGTACTTCTACGACGGCTCCTGCATCCCATCCCCCGACGGTGAGTTCGGGGACGAGTTTGAGCCGCGAGTGGCTGCTTTCGGGGCTCACAAGGCAGACCTGCAAGGCTCAGACGAGGACGAGCACGTGCGAGCACCCACGGGCCACCACCAGGCCGGCCACTGCCTCATGTGGGCCTGCAAAGCATGCAAAAGGAAGTCCACCACCATGGATCGGCGGAAGGCGGCCACCATGCGCGAGCGGAGACGCCTGAAGAAGGTCAACCAGGCTTTCGACACGCTCAAGCGGTGCACCACGACCAACCCTAACCAGAGGCTGCCCAAGGTGGAGATCCTCAGGAATGCCATCCGCTACATTGAGAGTCTGCAGGAGCTGCTTAGGGAACAGGTGGAAAACTACTATAGCCTGCCGGGGCAGAGCTGCTCTGAGCCCACCAGCCCCACCTCAAGTTGCTCTGATGGCATGTAAATG

Solution

  • The key to this is take what you had so far as a functional unit of what you were trying to build. Your code worked for one sequence; extending it to multiple sequences is simply a matter of getting rid of the assumption that only one sequence is being considered:

    use strict;
    use warnings;
    use 5.010;
    
    use Data::Dumper ();
    
    sub count_dimers {
        my ($sequence) = @_;
    
        my %counts;
        $counts{substr($sequence, $_, 2)}++ for 0..length($sequence) - 2;
    
        my @dimers = qw(AA AG AC AT GA GG GC GT CC CG CA CT TT TA TG TC);
        %counts = map { $_ => $counts{$_} } @dimers;
    
        return %counts;
    }
    
    open(my $fh, '<', 'sample.txt');
    
    my @counts_by_sequence;
    while (my $sequence = <$fh>) {
        my %sequence_counts = count_dimers($sequence);
        push @counts_by_sequence, \%sequence_counts;
    }
    
    my %total_counts;
    for my $sequence_counts (@counts_by_sequence) {
        for my $dimer (keys %$sequence_counts) {
            $total_counts{$dimer} += ${ $sequence_counts}{$dimer};
        }
    }
    
    say Data::Dumper->Dump(
        [\%total_counts, \@counts_by_sequence],
        [qw(total_count counts_by_sequence)]
    );
    

    I left the output as an exercise to the reader, but the general shape of the transformation should be evident: What was once the entirety of the program is now a function that gets called once per sequence with the results per call stored and the results over all calls totaled.