java terminal parallel-processing bigdata gnu-parallel

Running GNU Parallel Java Job

Is this the correct way to execute a java job with the input myFile.txt? What I want to do is to run the MyJavaClass program with the input given into args[0], however, I want to run this locally on my machine on multiple cores rather than on a cluster.

parallel java MyJavaClass ::: myFile.txt

EDIT:

What I want to accomplish is the following:

java MyJavaClass arg1 arg2 arg3 
java MyJavaClass arg4 arg5 arg6
java MyJavaClass arg7 arg8 arg9

and I would like these jobs to run in parallel

Solution

If you have myFile.txt with millions of lines, and you want this split into one chunk per CPU core, and then run MyJavaClass on that input, and we assume that MyJavaClass reads from stdin (standard input) and prints to stdout (standard output) so the 3 lines would look something like this:

cat chunk1 | java MyJavaClass > output1
cat chunk2 | java MyJavaClass > output2
cat chunk3 | java MyJavaClass > output3

then it looks like this using GNU Parallel:

parallel -a myFile.txt --pipepart --block -1 java MyJavaClass > combined_output

If MyJavaClass instead takes a filename so the 3 lines look like this:

java MyJavaClass chunk1 > output1
java MyJavaClass chunk2 > output2
java MyJavaClass chunk3 > output3

then this may work:

# --fifo is fast, but may not work if MyJavaClass seeks into the file
parallel -a myFile.txt --pipepart --fifo --block -1 java MyJavaClass {} > combined_output
# --cat creates temporary files
parallel -a myFile.txt --pipepart --cat --block -1 java MyJavaClass {} > combined_output

If MyJavaClass outputs to a filename, so the 3 lines look like this

java MyJavaClass chunk1 --output-file chunk1.output
java MyJavaClass chunk2 --output-file chunk2.output
java MyJavaClass chunk3 --output-file chunk3.output

you can then use that {#} is the job number and thus is unique:

parallel [...] java MyJavaClass {} --output-file {#}.output