Search code examples
sedgnu-parallel

What's the correct usage of sed with parallel --jobs option?


parallel -a input --colsep ' ' --jobs 100 -I {} sed -i 's/{1}/{2}/g' file

input is a file delimited by space, where the first column is pattern and the second column is replacement.

The problem is that after I ran the command, not all patterns were replaced in file. Then I ran the same command again, more patterns were replaced, but still not all. However, if I change --jobs 100 to --jobs 1, it will work as expected (but much slower).

Is there any parameter necessary missing in my command?


Solution

  • Let us assume that input is big and file is huge.

    You really do not want to read file more than once.

    First you need to convert input into a single big sed script.

    cat input | parallel --colsep ' ' echo s/{1}/{2}/g >bigsed
    

    As @tripleee says, you may need to sort this, so the longest source string is first.

    Then you need to split file into one chunk per CPU thread, run the script on each chunk and finally append the replaced chunks back in order:

    parallel --pipepart -a file -k sed -f bigsed > replaced
    

    You will need that /tmp has enough free space to contain replaced or set $TMPDIR to a dir that is.