I have a processing chain that goes along these lines:
One way this has worked before is
cat input | preprocess.sh | transform.py | postprocess.sh
And this works well with processing batches of input data.
However, I now find myself needing to implement this as a server functionality in Python - I need to be able to accept a single data item, run the pipeline and spit it back out quickly.
The central step I just call from within Python, so that's the easy part. Postprocessing is also relatively easy.
Here's the issue: the preprocessing code consists of 4 different scripts, each outputting data to the next one and two of which need to load model files from disk to work. That loading is relatively slow and does horrible things to my execution time. I thus think I need to keep them in memory somehow, write to their stdin
s and read the output.
However, I find that for every single link in my chain, I can't write to stdin
and read stdout
without closing stdin
, and that would render the method useless as I would then have to reopen the process and load the model again.
Do note that this is not a problem with my scripts, as for each link in the chain
cat input_data | preprocessing_script_i.sh
returns just what it should within Bash.
Here are the things I have tried up until now:
stdin
and flush
it - waits indefinitely on readlineprocess.communicate
- kills the process and is thus out of the question.pty
handles - hangs on readlinestdout
while writing to stdin
from the main threadbufsize
in the call to subprocess
Is there some way to do this from Python? Is this even possible at all, as I'm starting to doubt that? Can reimplementing this pipeline (without touching the elements, as that's not quite feasible for my use case) in another language solve this for me?
I'm sorry, the ideas proposed were great and this is probably not going to help many people in the future, but this is how I solved the problem.
It turns out perl
has a -b
flag for printing in line buffered mode. Once I plugged that into the perl -b script.perl
part of the processing pipeline, things started moving smoothly and the simple process.write()
followed by .flush()
was enough to get the output.
I will try to change the question tags and title to better fit the actual problem.