I have to process ten very big files. Each file takes about two days to process by my_profiler
. I can parallelize the work so that my_profiler
runs on each file separately, hence using all of my system's cores. My approach at parallelizing the work is to run three processes in three different terminals at the time. I can't process more than four files at once, or my system starts getting unresponsive (hangs up).
My goal is to write a shell script which processes the ten files in batches of size three. Once processing of one file finishes, the terminal should be closed and processing of a new file should start in another terminal. As a terminal I want to use gnome-terminal
.
Currently I am stuck with the following script, which runs all processes in parallel:
for j in $jobs
do
gnome-terminal -- bash -c "my_profiler $j"
done
How I can wait until a shell script running in a instance of gnome-terminal
finishes?
My first thought was that I might need to send a signal form the old terminals once their job is finished.
I am not quite sure why you have to start a new gnome-terminal
for each job. But you could use xargs
in combination with -P
[1]. Running three my_profiler
in parallel at the same time:
echo "${jobs}" | xargs -P3 -I{} gnome-terminal --wait -e 'bash -c "my_profiler {}"'
Important here is to start gnome-terminal
with --wait
otherwise the terminal demonizes itself which will have the effect that xargs
starts the next process. --wait
was introduced with gnome-terminal
3.27.1.
The -I{}
option to xargs
defines a placeholder ({}
) which xargs
will replace with a filename before running the command [2]. In the example above, xargs
scans the command string (gnome-terminal --wait -e 'bash -c "my_profiler {}"'
) for {}
and replaces the found instances with the first file coming from stdin (echo "${jobs}" | ...
). The resulting string it then executes. xargs
will do this three times (-P3
), before it starts waiting for at least one process to finish. If this happens, xargs
will start the next process.
[1]: from man xargs
-P max-procs
,--max-procs=max-procs
Run up to max-procs processes at a time; the default is 1. If
max-procs
is 0,xargs
will run as many processes as possible at a time. Use the-n
option or the-L
option with-P
; otherwise chances are that only one exec will be done. Whilexargs
is running, you can send its process aSIGUSR1
signal to increase the number of commands to run simultaneously, or aSIGUSR2
to decrease the number. You cannot increase it above an implementation-defined limit (which is shown with--show-limits
). You cannot decrease it below 1.xargs
never terminates its commands; when asked to decrease, it merely waits for more than one existing command to terminate before starting another.Please note that it is up to the called processes to properly manage parallel access to shared resources. For example, if more than one of them tries to print to stdout, the ouptut will be produced in an indeterminate order (and very likely mixed up) unless the processes collaborate in some way to prevent this. Using some kind of locking scheme is one way to prevent such problems. In general, using a locking scheme will help ensure correct output but reduce performance. If you don't want to tolerate the performance difference, simply arrange for each process to produce a separate output file (or otherwise use separate resources).
[2]: from man xargs
-I replace-str
Replace occurrences of
replace-str
in the initial-arguments with names read from standard input. Also, unquoted blanks do not terminate input items; instead the separator is the newline character. Implies-x
and-L 1
.