Search code examples
phplinuxbashshellgnu-parallel

nesting GNU Parallel to process multiple huge files and split each file data to be processed as queue


I have a dir with almost 100 log files each weighing at 10~15 GB. The requirement is to read each file line by line(order doesn't matter at all), clean up the line json and dump it to the backend elasticsearch storage for indexing.

here is my worker that does this job

# file = worker.php

echo " -- New PHP Worker Started -- "; // to get how many times gnu-parallel initiated the worker
$dataSet = [];

while (false !== ($line = fgets(STDIN))) {

    // convert line text to json
    $l = json_decode($line);
    $dataSet[] = $l;

    if(sizeof($dataSet) >= 1000) {
        //index json to elasticsearch
        $elasticsearch->bulkIndex($dataSet);
        $dataSet = []; 
    }
}

With the help of answers here and here I am almost there and it is working (kind of) but just need to make sure that under the hood it is actually doing the stuff that I am assuming it is doing.

With just one file I can handle it as below

parallel --pipepart -a 10GB_input_file.txt  --round-robin php worker.php 

That works great. adding --round-robin makes sure that php worker process is started only once and then it just keeps receiving the data as pipeline (poor man's queue).

So for 4CPU machine, it fires up 4 php workers and crunches all the data very quickly.

To do the same for all files, here is my take on it

find /data/directory -maxdepth 1 -type f | parallel cat | parallel --pipe -N10000 --round-robin php worker.php 

Which kinda looks like working but I have a gut feeling that this is a wrong way of nesting parallel for all files.

And secondly, as it can not use --pipepart, I think it is slower.

Thirdly, once the job is complete, I see that on a 4cpu machine, only 4 workers were started and job got done. Is it right behavior? Shouldn't it start 4 workers for every file? Just wanna make sure I didn't miss any data.

Any idea how this could be done in a better way?


Solution

  • If they are roughly the same size why not simply give a single file to each:

    find /data/directory -maxdepth 1 -type f |
      parallel php worker.php '<' {}
    

    Another way is to use --pipepart on each of them:

    do_one() {
      parallel --pipepart -a "$1" --block -1 php worker.php
    }
    export -f do_one
    find /data/directory -maxdepth 1 -type f | parallel -j1 do_one
    

    If it does not take a long time to start php worker.php then the last may be preferable, because it will distribute more evenly if the files are of very different sizes, thus if the last file is huge, you do not end up waiting for a single process to finish processing that.