Search code examples
bashgnu-parallel

GNU Parallel: How to start job number replacement string at zero


I'm very pleased with the speed of using GNU parallel with splitting multi-GB CSV database export files into manageable chunks. However, the problem I'm having is that I'd like my output file names to be in the format some_table.csv.part_0000.csv and start at zero (the import tool requires this). Getting "0001" was a challenge, but I managed to use printf to achieve this. I can't get the decrement to work though.

My Command:

FILE=some_table; parallel -v --joblog split.log --pipepart --recend '-- EOL\n' --block 25M "cat > $FILE.csv.part_$(printf "%04d"{#}).csv" :::: $FILE.csv

Doing things like expression expansion ($FILE.csv.part_$(({#}-1)).csv) don't work because {#} confuses the inner subshell. So does PART=$(({#}-1)); cat > $FILE.csv.part_$PART.csv.

Any suggestions?


Solution

  • Use the {= =} contruct:

    FILE=some_table;  parallel -v --joblog split.log --pipepart --recend '-- EOL\n' --block 25M "cat > $FILE.csv.part_"'{=$_=sprintf("%04d",$job->seq()-1)=}'".csv" :::: $FILE.csv
    

    If you are going to use it a lot then define your own replacement string by putting this into ~/.parallel/config:

    --rpl '{0000#} $_=sprintf("%04d",$job->seq()-1)'
    

    Then use {0000#}:

    seq 11 | parallel echo {0000#}
    

    If you just want the numbers to be fixed width (and not necessarily 4 digits):

    --rpl '{0#} $f="%0".int(1+log(total_jobs()-1)/log(10))."d";$_=sprintf($f,$job->seq()-1)'
    

    Then use {0#}:

    seq 11 | parallel echo {0#}
    

    On a different note: Why save it to files at all? Why not pass it directly to the database importer and use --retries/--retry-failed to retry failed chunks?

    If you want it for jobslot:

    parallel --rpl '{0000%} $_=sprintf("%04d",$job->slot())' echo {0000%} ::: {1..100}
    

    You can also use a dynamic replacement string:

    --rpl '{(0+?)%} $l=length $$1; $_=sprintf("%0${l}d",$job->slot())'
    --rpl '{(0+?)#} $l=length $$1; $_=sprintf("%0${l}d",$job->seq())'
    
    parallel echo {0%} ::: {1..100}
    parallel echo {0#} ::: {1..100}
    parallel echo {00%} ::: {1..100}
    parallel echo {00#} ::: {1..100}
    parallel echo {000%} ::: {1..100}
    parallel echo {000#} ::: {1..100}
    

    Since version 20210222 you can do:

    parallel --plus echo {0%} ::: {1..100}
    parallel --plus echo {0#} ::: {1..100}
    

    which will automatically detect the needed leading zeros.