bash image-processing parallel-processing scientific-computing

Parallelization of image processing programs on large image set

I currently have a very large directory containing over 9000 folders with each holding jpeg images in them (average of 40 per folder).

My program takes an input folder of images and outputs the feature vector of the images in that folder to text files:

./process_image images/ output/

I also have a script with usage as the following:

./script.sh dirlist.txt images/ output/ 1

The first input dirlist.txt contains the folder names inside the input directory The 2nd and 3rd inputs are the base directory for inputs and outputs. The 4th argument is the index for which entry in the dirlist I want to access

The above example would call, assuming that imageset1 was at index 1 in dirlist.txt:

./process_image images/imageset1/ output/imageset1/

If I were to do this sequentially, it would take me days to process all 9000 folders. What is the best method for parallelization in this case? Should I write a script that separates the 9000 folders into blocks and runs the script separately, each running a certain range of indexes? Also, how do I determine how many programs I can run, given that one executable can range from 100 MB to 1GB in RAM? I have 32 GB of RAM.

Solution

I regularly process 65,000+ images per day and I just about always use GNU Parallel - see here and here. I wouldn't bother parallelising C code!

It allows you to specify how many jobs to run in parallel, or just use the default of one job per CPU core. It is pretty simple to use. All you would do is change your script.sh so that rather than starting the jobs, it just echoes all the commands it would have started, one per line, to stdout and then you pipe that into parallel, like this

script.sh | parallel

You can add flags like -j 8 to run 8 jobs in parallel, or -k to keep the output order if that is relevant.

script.sh | parallel -j 8 -k

Likewise, if you are concerned about memory usage, you can tell parallel to only start new jobs when the system has at least 1GB of free memory:

script.sh | parallel --memfree 1G

You can also add a list of other machines and it will distribute the jobs across them for you :-)

Here is a tiny example:

#!/bin/bash
# script.sh

for i in {0..99}; do
   echo "echo Start job $i; sleep 5; echo End job $i"
done

Then

script.sh | parallel

and the 500 seconds of work gets done in 70 seconds on my 8-core machine, or 21 seconds if I use parallel -j 25.