Fastest way to read a fastq with scikit-bio

I am trying to read a fastq formatted text file using scikit-bio.

Given that it is a fairly large file, performing operations is quite slow.

Ultimately, I am attempting to dereplicate the fastq file into a dictionary:

f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')

seq_dic = {}
for seq in seqs:
    seq = str(seq)
    if seq in seq_dic.keys():
        seq_dic[seq] +=1
    else:
        seq_dic[seq] = 1

Most of the time here is used during the reading of the file:

%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq')

for seq in itertools.islice(seqs, 100000):
    seq

CPU times: user 46.2 s, sys: 334 ms, total: 46.5 s
Wall time: 47.8 s

My understanding is that not verifying the sequences would improve run time, however that does not appear to be the case:

%%time
f = 'Undetermined_S0_L001_I1_001.fastq'
seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8')

for seq in itertools.islice(seqs, 100000):
    seq

CPU times: user 47 s, sys: 369 ms, total: 47.4 s
Wall time: 48.9 s

So my question is, first why isn't verify=False improving run time and second is there a faster way using scikit-bio to read sequences?

Solution

first why isn't verify=False improving run time

verify=False is a parameter accepted by scikit-bio's I/O API generally. It is not specific to a particular file format. verify=False tells scikit-bio to not invoke the file format's sniffer to double-check that the file is in the format specified by the user. From the docs [1]:

verify : bool, optional
    When True, will double check the format if provided.

So verify=False doesn't turn off sequence data validation; it turns off file format sniffer verification. You will have minimal performance gains with verify=False.

seqs = skbio.io.read(f, format='fastq', verify=False, variant='illumina1.8') will produce a generator of skbio.Sequence objects. Sequence alphabet validation is not performed because skbio.Sequence does not have an alphabet, so that isn't where your performance bottleneck is. Note that if you want to read your FASTQ file into a specific type of biological sequence (DNA, RNA, or Protein), you can pass constructor=skbio.DNA (for example). This will perform alphabet validation for the relevant sequence type, and that currently cannot be toggled off when reading. Since you're having performance issues, I don't recommend passing constructor as that will only slow things down even more.

and second is there a faster way using scikit-bio to read sequences?

There isn't a faster way to read FASTQ files with scikit-bio. There is an issue [2] exploring ideas that could speed this up, but those ideas haven't been implemented.

scikit-bio is slow at reading FASTQ files because it supports reading sequence data and quality scores that may span multiple lines. This complicates the reading logic and has a performance hit. FASTQ files with multi-line data are not common anymore however; Illumina used to produce these files but they now prefer/recommend writing FASTQ records that are exactly four lines (sequence header, sequence data, quality header, quality scores). In fact, this is how scikit-bio writes FASTQ data. With this simpler record format, it is much faster and easier to read a FASTQ file. scikit-bio is also slow at reading FASTQ files because it decodes and validates the quality scores. It also stores sequence data and quality scores in a skbio.Sequence object, which has performance overhead.

In your case, you don't need the quality scores decoded, and you likely have a FASTQ file with simple four-line records. Here's a Python 3 compatible generator that reads a FASTQ file and yields sequence data as Python strings:

import itertools

def read_fastq_seqs(filepath):
    with open(filepath, 'r') as fh:
        for seq_header, seq, qual_header, qual in itertools.zip_longest(*[fh] * 4):
            if any(line is None for line in (seq_header, seq, qual_header, qual)):
                raise Exception(
                    "Number of lines in FASTQ file must be multiple of four "
                    "(i.e., each record must be exactly four lines long).")
            if not seq_header.startswith('@'):
                raise Exception("Invalid FASTQ sequence header: %r" % seq_header)
            if qual_header != '+\n':
                raise Exception("Invalid FASTQ quality header: %r" % qual_header)
            if qual == '\n':
                raise Exception("FASTQ record is missing quality scores.")

            yield seq.rstrip('\n')

You can remove the validation checks here if you're certain your file is a valid FASTQ file containing records that are exactly four lines long.

This isn't related to your question, but I wanted to note that you may have a bug in your counter logic. When you see a sequence for the first time, your counter is set to zero instead of 1. I think the logic should be:

if seq in seq_dic:  # calling .keys() is necessary
    seq_dic[seq] +=1
else:
    seq_dic[seq] = 1

[1] http://scikit-bio.org/docs/latest/generated/skbio.io.registry.read.html

[2] https://github.com/biocore/scikit-bio/issues/907