I want to count the occurrences of around 500 patterns in a large .fastq file (59 million lines). The patterns are all exactly 20 characters long.
In unix this would be a simple:
grep -F -o -f patterns.txt big_file.fastq | sort | uniq -c
However, I wish to avoid writing a temporary patterns file so I created a pipe using python's subprocess library:
from subprocess import Popen, PIPE, STDOUT
p1 = Popen(["grep", "-F", "-o", "-f", "-", "big_file.fastq"], shell = False, stdin = PIPE, stdout = PIPE, stderr= STDOUT)
p2 = Popen(["sort"], shell = False, stdin = p1.stdout, stdout = PIPE, stderr = STDOUT)
p3 = Popen(["uniq", "-c"], shell = False, stdin = p2.stdout, stdout = PIPE, stderr = STDOUT)
I then call communicate() on this, providing an encoded io.StringIO file-like object as input (which I is passed to the grep command using the '-'):
import io
patterns_file = io.StringIO("\n".join(patterns_list))
p3.communicate(input = patterns_file.read().encode('utf-8'))[0]
When I call communicate() on uniq like this, this works fine.
However, while testing I erroneously called it on the first part of the pipe:
p1.communicate(input = patterns_file.read().encode('utf-8'))[0]
This gave me completely wrong outputs, including matches that were shorter or longer than the expected 20 characters.
I don't understand why this is. Wouldn't calling communicate() on p1 only involve that part of the pipe and ignore the rest? Removing p2 and p3 caused p1 to grep correctly. I feel I am missing something about how Popen works.
Any help is appreciated.
When you instantiate Popen
objects, the subprocesses they refer to are immediately started. Thus, even if you only call communicate()
on p1
, p2
and p3
are also running.
Why does this matter? Because p2
still has its stdin attached to the FIFO to which p1
is writing its output!
If the sort
operation on p2
is still reading content at the same time you ask your Python program to read that same content directly, you end up with p1
's output divided between them. Hilarity can be expected to ensue: The only way reads being split between two programs wouldn't result in visibly corrupt data is if both p2
and communicate()
were reading blocks that were multiples of 20 bytes (while still small enough for the operating system to not split them into multiple syscalls); however, typical chunk sizes used for unbuffered reads are multiples of 4096, creating 4 bytes of offset from the record boundary each time a chunk is read.
By the way -- for many programs this wouldn't have as severe an effect, because FIFOs have a relatively small buffer; a program that writes a line of output for each line of input it reads would end up blocked on output very quickly, and thus would stop reading further input until its output is at least partially flushed. sort
is an exception, because it needs to read all its input before it knows what its first line of output will be!