Search code examples
bashawkgnu-parallel

Parallelize a awk script with multiple input files and changing the name of the output file


I have a series of text files in a folder sub.yr_by_yr which I pass to a for loop to subset a Beagle file from the header. I want to parallelize this script to subset the Beagle file from the header values (which is done using my subbeagle.awk script). I use the title of the text files to export the subset to a new file name using the base pattern matching in bash (file11=${file1%.subbeagle.txt}) to get the desired output (MM.beagle.${file11}.gz)

for file1 in $(ls sub.yr_by_yr)
do 
echo -e  "Doing sub-samples \n $file1"
file11=${file1%.subbeagle.txt}
awk -f subbeagle.awk \
       ./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz
done

The for loop works, but takes for ever... hence the need for parallelization. the folder sub.yr_by_yr contains >10 files named something like similar to this: sp.yrseries.site1.1.subbeagle.txt, sp.yrseries.site1.2.subbeagle.txt, sp.yrseries.site1.3.subbeagle.txt...

I've tried

parallel "file11=${{}%.subbeagle.txt}; awk -f $SUBBEAGLEAWKSCRIPT ./sub.yr_by_yr/{} <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz" ::: sub.yr_by_yr/*.subbeagle.txt

But it gives me 'bad substitution'

How could I use the awk script in parallel and rename the files accordingly?

Content of subbeagle.awk:

# Source: https://stackoverflow.com/questions/74451358/select-columns-based-on-their-names-from-a-file-using-awk

BEGIN  { FS=OFS="\t" }                             # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
        { sep=""
          for (i=1; i<=NF; i++) {
              if (FNR==1 && ($i in headers)) {
                 fldids[i]
              }
              if (i in fldids) {
                 printf "%s%s",sep,$i
                 sep=OFS                            # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
              }
          }
          print ""
        }

Content of MajorMinor.beagle.gz

marker      allele1  allele2  FINCH_WB_ID1_splitMerged  FINCH_WB_ID1_splitMerged  FINCH_WB_ID1_splitMerged  FINCH_WB_ID2_splitMerged  FINCH_WB_ID2_splitMerged
chr1_34273  G        C        0.79924                   0.20076                   3.18183e-09               0.940649                      0.0593509
chr1_34285  G        A        0.79924                   0.20076                   3.18183e-09               0.969347                      0.0306534
chr1_34291  G        C        0.666111                  0.333847                  4.20288e-05               0.969347                      0.0306534
chr1_34299  C        G        0.000251063               0.999498                  0.000251063               0.996035                      0.00396529

UPDATE:

I was able to get this from this source:

parallel "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{/.}_test.gz'" ::: sub.yr_by_yr/*.subbeagle.txt

The only fancy thing that needs to be removed is the .subbeagle par of the input file name...


Solution

  • So the parallel tutorial helped me here:

    parallel --rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;' "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'" ::: sub.yr_by_yr/*.subbeagle.txt
    

    Let's break this:

    --rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;'
    
    • --rpl will "define a shorthand replacement string" (see parallel tutorial and another example here)

    • {mymy} is my 'new' replacement string, which will execute what is after it.

    • s:.*/::; is the definition to {/} (see parallel tutorial, search for "Perl expression replacement string", the last part of that section shows the definition of 7 'default' replacement strings)

    • s:\.[^.]+$::;s:\.[^.]+$::; removes 2 extensions (so .subbeagle.txt where .txt is the first extension and .subbeagle is the second)

      "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'"
      
    • is the subsetting and compressing par of the script. Note that the {mymy} is where the replacement will take place. As you can see {} will be in input string. The rest is unchanged!

    • ::: sub.yr_by_yr/*.subbeagle.txt will pass all the files to parallel as input.

    It took ~ 2 hours to do at least ~5 files, but using 22 cores, I could do all files this in a fraction of the time (~20 minutes)!