parallel -a input --colsep ' ' --jobs 100 -I {} sed -i 's/{1}/{2}/g' file
input
is a file delimited by space, where the first column is pattern and the second column is replacement.
The problem is that after I ran the command, not all patterns were replaced in file
. Then I ran the same command again, more patterns were replaced, but still not all.
However, if I change --jobs
100 to --jobs 1
, it will work as expected (but much slower).
Is there any parameter necessary missing in my command?
Let us assume that input
is big and file
is huge.
You really do not want to read file
more than once.
First you need to convert input
into a single big sed
script.
cat input | parallel --colsep ' ' echo s/{1}/{2}/g >bigsed
As @tripleee says, you may need to sort this, so the longest source string is first.
Then you need to split file
into one chunk per CPU thread, run the script on each chunk and finally append the replaced chunks back in order:
parallel --pipepart -a file -k sed -f bigsed > replaced
You will need that /tmp
has enough free space to contain replaced
or set $TMPDIR
to a dir that is.