Search code examples
pythonmemorystreampopen

Alternatives to Python Popen.communicate() memory limitations?


I have the following chunk of Python code (running v2.7) that results in MemoryError exceptions being thrown when I work with large (several GB) files:

myProcess = Popen(myCmd, shell=True, stdout=PIPE, stderr=PIPE)
myStdout, myStderr = myProcess.communicate()
sys.stdout.write(myStdout)
if myStderr:
    sys.stderr.write(myStderr)

In reading the documentation to Popen.communicate(), there appears to be some buffering going on:

Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.

Is there a way to disable this buffering, or force the cache to be cleared periodically while the process runs?

What alternative approach should I use in Python for running a command that streams gigabytes of data to stdout?

I should note that I need to handle output and error streams.


Solution

  • I think I found a solution:

    myProcess = Popen(myCmd, shell=True, stdout=PIPE, stderr=PIPE)
    for ln in myProcess.stdout:
        sys.stdout.write(ln)
    for ln in myProcess.stderr:
        sys.stderr.write(ln)
    

    This seems to get my memory usage down enough to get through the task.

    Update

    I have recently found a more flexible way of handing data streams in Python, using threads. It's interesting that Python is so poor at something that shell scripts can do so easily!