I'm pretty new to Python, and I have written a (probably very ugly) script that is supposed to randomly select a subset of sequences from a fastq-file. A fastq-file stores information in blocks of four rows each. The first row in each block starts with the character "@". The fastq file I use as my input file is 36 GB, containing about 14,000,000 lines.
I tried to rewrite an already existing script that used way too much memory, and I managed to reduce the memory usage a lot. But the script takes forever to run, and I don't see why.
parser = argparse.ArgumentParser()
parser.add_argument("infile", type = str, help = "The name of the fastq input file.", default = sys.stdin)
parser.add_argument("outputfile", type = str, help = "Name of the output file.")
parser.add_argument("-n", help="Number of sequences to sample", default=1)
args = parser.parse_args()
def sample():
linesamples = []
infile = open(args.infile, 'r')
outputfile = open(args.outputfile, 'w')
# count the number of fastq "chunks" in the input file:
seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
# randomly select n fastq "chunks":
seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
# make a list of the lines that are to be fetched from the fastq file:
for i in seqsamples:
linesamples.append(int(4*i+0))
linesamples.append(int(4*i+1))
linesamples.append(int(4*i+2))
linesamples.append(int(4*i+3))
# fetch lines from input file and write them to output file.
for i, line in enumerate(infile):
if i in linesamples:
outputfile.write(line)
The grep-step takes practically no time at all, but after over 500 minutes, the script still hasn't started to write to the output file. So I suppose it is one of the steps in between grep and the last for-loop that takes such a long time. But I don't understand which step exactly, and what I can do to speed it up.
Depending on the size of linesamples
, the if i in linesamples
will take a long time since you are searching through a list for each iteration through infile
. You could convert this into a set
to improve the lookup time. Also, enumerate
is not very efficient - I have replaced that with a line_num
construct which we increment in each iteration.
def sample():
linesamples = set()
infile = open(args.infile, 'r')
outputfile = open(args.outputfile, 'w')
# count the number of fastq "chunks" in the input file:
seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
# randomly select n fastq "chunks":
seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
for i in seqsamples:
linesamples.add(int(4*i+0))
linesamples.add(int(4*i+1))
linesamples.add(int(4*i+2))
linesamples.add(int(4*i+3))
# make a list of the lines that are to be fetched from the fastq file:
# fetch lines from input file and write them to output file.
line_num = 0
for line in infile:
if line_num in linesamples:
outputfile.write(line)
line_num += 1
outputfile.close()