Search code examples
bashperlunixsplitcat

How to split files up and process them in parallel and then stitch them back? unix


I have a text file infile.txt as such:

abc what's the foo bar.
foobar hello world, hhaha cluster spatio something something.
xyz trying to do this in parallel
kmeans you're mean, who's mean?

Each line in the file will be processed by this perl command into the out.txt

`cat infile.txt | perl dosomething > out.txt`

Imagine if the textfile is 100,000,000 lines. I want to parallelize the bash command so i tried something like this:

$ mkdir splitfiles
$ mkdir splitfiles_processed
$ cd splitfiles
$ split -n3 ../infile.txt
$ for i in $(ls); do "cat $i | perl dosomething > ../splitfiles_processed/$i &"; done
$ wait
$ cd ../splitfiles_processed
$ cat * > ../infile_processed.txt

But is there a less verbose way to do the same?


Solution

  • The answer from @Ulfalizer gives you a good hint about the solution, but it lacks some details.

    You can use GNU parallel (apt-get install parallel on Debian)

    So your problem can be solved using the following command:

    cat infile.txt | parallel -l 1000 -j 10 -k --spreadstdin perl dosomething > result.txt
    

    Here is the meaning of the arguments:

    -l 1000: send 1000 lines blocks to command
    -j 10: launch 10 jobs in parallel
    -k: keep sequence of output
    --spreadstdin: sends the above 1000 line block to the stdin of the command