Search code examples
bashconcurrencybatch-processingxargs

Is there a way to force xargs to send multiple lines at once?


I have a job that reads data from a \n delimited stream and sends the information to xargs to process 1 line at a time. The problem is, this is not performant enough, but I know that if I altered the program such that the command executed by xargs was sent multiple lines instead of just one line at a time, it could drastically improve the performance of my script.

Is there a way to do this? I haven't been having any luck with various combinations of -L or -n. Unfortunately, I think I'm also stuck with -I to parameterize the input since my command doesn't seem to want to take stdin if I don't use -I.

The basic idea is that I'm trying to simulate mini-batch processing using xargs.

Conceptually, here's something similar to what I currently have written

contiguous-stream | xargs -d '\n' -n 10 -L 10 -I {} bash -c 'process_line {}'

^ in the above, process_line is easy to change so that it could process many lines at once, and this function right now is the bottleneck. For emphasis, above, -n 10 and -L 10 don't seem to do anything, my lines are still processing one at a time.


Solution

  • Multiple Lines Per Shell Invocation

    Don't use -I here. It prevents more than one argument from being passed at a time, and is outright major-security-bug dangerous when being used to substitute values into a string passed as code.

    contiguous-stream | xargs -d $'\n' -n 10 \
      bash -c 'for line in "$@"; do process_line "$line"; done' _
    

    Here, we're passing arguments added by xargs out-of-band from the code, in position populated from $1 and later, and then using "$@" to iterate over them.

    Note that this reduces overhead inasmuch as it passes multiple arguments to each shell (so you pay shell startup costs fewer times), but it doesn't actually process all those arguments concurrently. For that, you want...

    Multiple Lines In Parallel

    Assuming GNU xargs, you can use -P to specify a level of parallel processing:

    contiguous-stream | xargs -d $'\n' -n 10 -P 8 \
      bash -c 'for line in "$@"; do process_line "$line"; done' _
    

    Here, we're passing 10 arguments to each shell, and running 8 shells at a time. Tune your arguments to taste: Higher values of -n spend less time starting up new shells but increase the amount of waste at the end (if one process still has 8 to go and every other process is done, you're operating suboptimally).