I'm very pleased with the speed of using GNU parallel with splitting multi-GB CSV database export files into manageable chunks. However, the problem I'm having is that I'd like my output file names to be in the format some_table.csv.part_0000.csv
and start at zero (the import tool requires this). Getting "0001" was a challenge, but I managed to use printf to achieve this. I can't get the decrement to work though.
My Command:
FILE=some_table; parallel -v --joblog split.log --pipepart --recend '-- EOL\n' --block 25M "cat > $FILE.csv.part_$(printf "%04d"{#}).csv" :::: $FILE.csv
Doing things like expression expansion ($FILE.csv.part_$(({#}-1)).csv
) don't work because {#}
confuses the inner subshell. So does PART=$(({#}-1)); cat > $FILE.csv.part_$PART.csv
.
Any suggestions?
Use the {= =} contruct:
FILE=some_table; parallel -v --joblog split.log --pipepart --recend '-- EOL\n' --block 25M "cat > $FILE.csv.part_"'{=$_=sprintf("%04d",$job->seq()-1)=}'".csv" :::: $FILE.csv
If you are going to use it a lot then define your own replacement string by putting this into ~/.parallel/config:
--rpl '{0000#} $_=sprintf("%04d",$job->seq()-1)'
Then use {0000#}:
seq 11 | parallel echo {0000#}
If you just want the numbers to be fixed width (and not necessarily 4 digits):
--rpl '{0#} $f="%0".int(1+log(total_jobs()-1)/log(10))."d";$_=sprintf($f,$job->seq()-1)'
Then use {0#}:
seq 11 | parallel echo {0#}
On a different note: Why save it to files at all? Why not pass it directly to the database importer and use --retries/--retry-failed
to retry failed chunks?
If you want it for jobslot:
parallel --rpl '{0000%} $_=sprintf("%04d",$job->slot())' echo {0000%} ::: {1..100}
You can also use a dynamic replacement string:
--rpl '{(0+?)%} $l=length $$1; $_=sprintf("%0${l}d",$job->slot())'
--rpl '{(0+?)#} $l=length $$1; $_=sprintf("%0${l}d",$job->seq())'
parallel echo {0%} ::: {1..100}
parallel echo {0#} ::: {1..100}
parallel echo {00%} ::: {1..100}
parallel echo {00#} ::: {1..100}
parallel echo {000%} ::: {1..100}
parallel echo {000#} ::: {1..100}
Since version 20210222 you can do:
parallel --plus echo {0%} ::: {1..100}
parallel --plus echo {0#} ::: {1..100}
which will automatically detect the needed leading zeros.