Search code examples
bashpython-2.7subprocesspipedeadlock

Persistent subprocess pipeline - read stdout without


I have a processing chain that goes along these lines:

  1. Preprocess data in a few steps which include calling out perl, Bash and python scripts from a single Bash script, connecting those via pipes
  2. Transform data in python (the program I use sadly doesn't run on Python 3, so I think I'm forced to run 2.7)
  3. Postprocess data just like in the preprocessing step

One way this has worked before is

cat input | preprocess.sh | transform.py | postprocess.sh

And this works well with processing batches of input data.

However, I now find myself needing to implement this as a server functionality in Python - I need to be able to accept a single data item, run the pipeline and spit it back out quickly.

The central step I just call from within Python, so that's the easy part. Postprocessing is also relatively easy.

Here's the issue: the preprocessing code consists of 4 different scripts, each outputting data to the next one and two of which need to load model files from disk to work. That loading is relatively slow and does horrible things to my execution time. I thus think I need to keep them in memory somehow, write to their stdins and read the output.

However, I find that for every single link in my chain, I can't write to stdin and read stdout without closing stdin, and that would render the method useless as I would then have to reopen the process and load the model again.

Do note that this is not a problem with my scripts, as for each link in the chain

cat input_data | preprocessing_script_i.sh

returns just what it should within Bash.

Here are the things I have tried up until now:

  • simply write to stdin and flush it - waits indefinitely on readline
  • process.communicate - kills the process and is thus out of the question.
  • using master and slave pty handles - hangs on readline
  • using a queue and a thread to read stdout while writing to stdin from the main thread
  • messing around with bufsize in the call to subprocess

Is there some way to do this from Python? Is this even possible at all, as I'm starting to doubt that? Can reimplementing this pipeline (without touching the elements, as that's not quite feasible for my use case) in another language solve this for me?


Solution

  • I'm sorry, the ideas proposed were great and this is probably not going to help many people in the future, but this is how I solved the problem.

    It turns out perl has a -b flag for printing in line buffered mode. Once I plugged that into the perl -b script.perl part of the processing pipeline, things started moving smoothly and the simple process.write() followed by .flush() was enough to get the output.

    I will try to change the question tags and title to better fit the actual problem.