Search code examples
bashparallel-processinggnu-parallel

Using Parallel with a sed command iterating over thousands of files


I have 100,000s of files that I wish to iterate the below sed command over:

sed -s -i -e 's/[[:space:]].*//' -e '1 s/^/>/g' -e '3 s/|*//g' -e '3 s/^/>ref/g' -e '1h;2H;1,2d;4G'

So far, I have been using a bash loop:

for i in read_* ; do
    sed -s -i -e 's/[[:space:]].*//' -e '1 s/^/>/g' -e '3 s/|*//g' -e '3 s/^/>ref/g' -e '1h;2H;1,2d;4G' $i
    mv $i $i.fasta
done

How can I use GNU Parallel to speed this up?

ls read_* > list.read.txt
parallel -j $cores -a list.read.txt sed -s -i -e 's/[[:space:]].*//' -e '1 s/^/>/g' -e '3 s/|*//g' -e '3 s/^/>ref/g' -e '1h;2H;1,2d;4G' []

I tried the above method where I create a list of files to iterate over and perform 10 jobs at once, however I get sed related error commands.


Solution

  • Try

    parallel -q -v -j "$cores" -a list.read.txt sed -s -i -e 's/[[:space:]].*//' -e '1 s/^/>/g' -e '3 s/|*//g' -e '3 s/^/>ref/g' -e '1h;2H;1,2d;4G'
    
    • The -q option is necessary to quote special characters (spaces, >, ...) in the command arguments.
    • The [] was causing the code to break when I tested it, so I removed it. I don't know what it was supposed to do.
    • I added quotes to "$cores" because variable expansions should almost always be quoted. See When to wrap quotes around a shell variable?. Use Shellcheck to find missing quotes, and many other shell code errors.