I have a job that reads data from a \n delimited stream and sends the information to xargs to process 1 line at a time. The problem is, this is not performant enough, but I know that if I altered the program such that the command executed by xargs was sent multiple lines instead of just one line at a time, it could drastically improve the performance of my script.
Is there a way to do this? I haven't been having any luck with various combinations of -L
or -n
. Unfortunately, I think I'm also stuck with -I
to parameterize the input since my command doesn't seem to want to take stdin if I don't use -I
.
The basic idea is that I'm trying to simulate mini-batch processing using xargs.
Conceptually, here's something similar to what I currently have written
contiguous-stream | xargs -d '\n' -n 10 -L 10 -I {} bash -c 'process_line {}'
^ in the above, process_line
is easy to change so that it could process many lines at once, and this function right now is the bottleneck. For emphasis, above, -n 10
and -L 10
don't seem to do anything, my lines are still processing one at a time.
Don't use -I
here. It prevents more than one argument from being passed at a time, and is outright major-security-bug dangerous when being used to substitute values into a string passed as code.
contiguous-stream | xargs -d $'\n' -n 10 \
bash -c 'for line in "$@"; do process_line "$line"; done' _
Here, we're passing arguments added by xargs
out-of-band from the code, in position populated from $1
and later, and then using "$@"
to iterate over them.
Note that this reduces overhead inasmuch as it passes multiple arguments to each shell (so you pay shell startup costs fewer times), but it doesn't actually process all those arguments concurrently. For that, you want...
Assuming GNU xargs
, you can use -P
to specify a level of parallel processing:
contiguous-stream | xargs -d $'\n' -n 10 -P 8 \
bash -c 'for line in "$@"; do process_line "$line"; done' _
Here, we're passing 10 arguments to each shell, and running 8 shells at a time. Tune your arguments to taste: Higher values of -n
spend less time starting up new shells but increase the amount of waste at the end (if one process still has 8 to go and every other process is done, you're operating suboptimally).