Count subsequences in hundreds of GB of data

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.

To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb

Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:

abcde => 1

Next line, it moves one character to the right and does the same:

bcdef => 1

It then continues for the rest of the string and prints the 5 most frequent sequences:

open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'

my $seq = <$in>; # read whole file into string
my $len = length($seq);

my $seq_length = 5; # set k-mer length
my %data;

for (my $i = 0; $i <= $len - $seq_length; $i++) {
     my $kmer = substr($seq, $i, $seq_length);
     $data{$kmer}++;
}

# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
    print "$kmer $data{$kmer}\n";
    $count++;
    last if $count >= 5;
}

ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1

However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.

I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.

My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.

Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
Is substr the most memory efficient tool for this task?

Solution

From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once

This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory

The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer

After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file

I've also use constants for the K-mer size and the chunk size. They are constant after all!

The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort

use strict;
use warnings 'all';

use constant SEQ_LENGTH => 5;           # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB

my $in_file = shift // 'in.txt';

open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};

my %data;
my $chunk;
my $length = 0;

while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {

    $length += $size;

    for my $offset ( 0 .. $length - SEQ_LENGTH ) {
         my $kmer = substr $chunk, $offset, SEQ_LENGTH;
         ++$data{$kmer};
    }

    $chunk = substr $chunk, -(SEQ_LENGTH-1);
    $length = length $chunk;
}

my @kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for @kmers[0..4];

output

ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1

Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.

The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.