I'm trying to use GNU Parallel to help me process some remote files that I don't want to save locally.
My command looks somewhat like that:
python list_files.py | \
parallel -j5 'aws s3 cp s3://s3-bucket/{} -' | \
parallel -j5 --round --pipe -l 5000 "python process_and_print.py"
process_and_print.py
prints output for some input lines, but that output doesn't get to stdout immediately like I expected, instead I only see the output after the process is finished. If I remove the --round
parameter is all works as expected.
Where does all that data get saved? Do I have a way to print it to stdout, line by line, without buffering?
Where does all that data get saved?
All buffered output from GNU Parallel is buffered in temporary files in $TMPDIR / --tmpdir
which defaults to /tmp
. You cannot see the files, as they are immediately removed (but kept open) to avoid you having to clean up, if GNU Parallel is killed.
Do I have a way to print it to stdout, line by line,
--line-buffer
without buffering?
-u
disables buffering all together, but then you cannot guarantee line-by-line.
--line-buffer
buffers a full line in memory from version 20170822 and thus does not buffer in /tmp
.