Search code examples
pythonsubprocessgeneratorpopen

use generator as subprocess input; got "I/O operation on closed file" exception


I have a large file that needs to be processed before feeding to another command. I could save the processed data as a temporary file but would like to avoid it. I wrote a generator that processes each line at a time then following script to feed to the external command as input. however I got "I/O operation on closed file" exception at the second round of the loop:

cmd = ['intersectBed', '-a', 'stdin', '-b', bedfile]
p = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for entry in my_entry_generator: # <- this is my generator
    output = p.communicate(input='\t'.join(entry) + '\n')[0]
    print output

I read another similar question that uses p.stdin.write. but still had the same problem.

What I did wrong?

[edit] I replaced last two statements with following (thanks SpliFF):

    output = p.communicate(input='\t'.join(entry) + '\n')
    if output[1]: print "error:", output[1]
    else: print output[0]

to see if there was any error by the external program. But no. Still have the same exception at p.communicate line.


Solution

  • The communicate method of subprocess.Popen objects can only be called once. What it does is it sends the input you give it to the process while reading all the stdout and stderr output. And by "all", I mean it waits for the process to exit so that it knows it has all output. Once communicate returns, the process no longer exists.

    If you want to use communicate, you have to either restart the process in the loop, or give it a single string that is all the input from the generator. If you want to do streaming communication, sending data bit by bit, then you have to not use communicate. Instead, you would need to write to p.stdin while reading from p.stdout and p.stderr. Doing this is tricky, because you can't tell which output is caused by which input, and because you can easily run into deadlocks. There are third-party libraries that can help you with this, like Twisted.

    If you want to do this interactively, sending some data and then waiting for and processing the result before sending more data, things get even harder. You should probably use a third-party library like pexpect for that.

    Of course, if you can get away with just starting the process inside the loop, that would be a lot easier:

    cmd = ['intersectBed', '-a', 'stdin', '-b', bedfile]
    for entry in my_entry_generator:
        p = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output = p.communicate(input='\t'.join(entry) + '\n')[0]
        print output