Search code examples
bashparallel-processingstring-comparisonlarge-files

GNU parallel with custom script doing string comparison


The follwoing script.sh compares part of a string (coming from stdin by cating a csv file) to a defined string and reports the differences in a certain format

#!/usr/bin/env bash

reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
  line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
  output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
  echo "$(echo ${line:0:35}, $output)"
done < "${1:-/dev/stdin}"

It is intendet to be executed on a number of rows from a very large file in the format

XYZ,ABMDEFG

and it works well when i use it in a pipe:

cat large_file | ./find_something.sh

However, when I try to use it with parallel, i get this error:

$  cat large_file | parallel ./find_something.sh
./find_something.sh: line 9: XYZ, ABMDEFG : No such file or directory

What is causing this? Is parallel supposed to work for something like this, if I want to redirect the output to a single file afterwards?

Less important side note: I'm rather proud of my string comparison method, but if someone has a faster way to get from comparing ABCDEFG and XYZ,ABMDEFG to obtain XYZ,C3M I'd be happy to hear that, too.

Edit:

I should have said, I also want to preserve the order of each line in the output, corresponding to the input. Is that possible using parallel?


Solution

  • Your script accepts its input from a file (defaulting to stdin), whereas parallel will pass input as arguments, not via stdin. In that sense, parallel is closer to xargs.

    Presumably, you want each of the lines in large_file to be processed as a unit, possibly in parallel.

    That means you need your script to only process one such line at a time, and let parallel call your script many times, once for each line.

    So your script should look like this:

    #!/usr/bin/env bash
    
    reference="ABCDEFG"
    ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
    line="$1"
    line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
    output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
    echo "$(echo ${line:0:35}, $output)"
    

    Then you can redirect to a file as follows:

    cat large_file | parallel ./find_something.sh > output_file