Search code examples
pythonperformancebioinformaticsfastq

How can I make my Python script faster?


I'm pretty new to Python, and I have written a (probably very ugly) script that is supposed to randomly select a subset of sequences from a fastq-file. A fastq-file stores information in blocks of four rows each. The first row in each block starts with the character "@". The fastq file I use as my input file is 36 GB, containing about 14,000,000 lines.

I tried to rewrite an already existing script that used way too much memory, and I managed to reduce the memory usage a lot. But the script takes forever to run, and I don't see why.

parser = argparse.ArgumentParser()
parser.add_argument("infile", type = str, help = "The name of the fastq input file.", default = sys.stdin)
parser.add_argument("outputfile", type = str, help = "Name of the output file.")
parser.add_argument("-n", help="Number of sequences to sample", default=1)
args = parser.parse_args()


def sample():
    linesamples = []
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    # make a list of the lines that are to be fetched from the fastq file:
    for i in seqsamples:
        linesamples.append(int(4*i+0))
        linesamples.append(int(4*i+1))
        linesamples.append(int(4*i+2))
        linesamples.append(int(4*i+3))
    # fetch lines from input file and write them to output file.
    for i, line in enumerate(infile):
        if i in linesamples:
            outputfile.write(line)

The grep-step takes practically no time at all, but after over 500 minutes, the script still hasn't started to write to the output file. So I suppose it is one of the steps in between grep and the last for-loop that takes such a long time. But I don't understand which step exactly, and what I can do to speed it up.


Solution

  • Depending on the size of linesamples, the if i in linesamples will take a long time since you are searching through a list for each iteration through infile. You could convert this into a set to improve the lookup time. Also, enumerate is not very efficient - I have replaced that with a line_num construct which we increment in each iteration.

    def sample():
        linesamples = set()
        infile = open(args.infile, 'r')
        outputfile = open(args.outputfile, 'w')
        # count the number of fastq "chunks" in the input file:
        seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
        # randomly select n fastq "chunks":
        seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
        for i in seqsamples:
            linesamples.add(int(4*i+0))
            linesamples.add(int(4*i+1))
            linesamples.add(int(4*i+2))
            linesamples.add(int(4*i+3))
        # make a list of the lines that are to be fetched from the fastq file:
        # fetch lines from input file and write them to output file.
        line_num = 0
        for line in infile:
            if line_num in linesamples:
                outputfile.write(line)
            line_num += 1
        outputfile.close()