I have a series of text files in a folder sub.yr_by_yr
which I pass to a for loop to subset a Beagle
file from the header. I want to parallelize this script to subset the Beagle file from the header values (which is done using my subbeagle.awk
script). I use the title of the text files to export the subset to a new file name using the base pattern matching in bash (file11=${file1%.subbeagle.txt}
) to get the desired output (MM.beagle.${file11}.gz
)
for file1 in $(ls sub.yr_by_yr)
do
echo -e "Doing sub-samples \n $file1"
file11=${file1%.subbeagle.txt}
awk -f subbeagle.awk \
./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz
done
The for loop works, but takes for ever... hence the need for parallelization. the folder sub.yr_by_yr
contains >10 files named
something like similar to this: sp.yrseries.site1.1.subbeagle.txt
, sp.yrseries.site1.2.subbeagle.txt
, sp.yrseries.site1.3.subbeagle.txt
...
I've tried
parallel "file11=${{}%.subbeagle.txt}; awk -f $SUBBEAGLEAWKSCRIPT ./sub.yr_by_yr/{} <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz" ::: sub.yr_by_yr/*.subbeagle.txt
But it gives me 'bad substitution
'
How could I use the awk script in parallel and rename the files accordingly?
Content of subbeagle.awk
:
# Source: https://stackoverflow.com/questions/74451358/select-columns-based-on-their-names-from-a-file-using-awk
BEGIN { FS=OFS="\t" } # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
{ sep=""
for (i=1; i<=NF; i++) {
if (FNR==1 && ($i in headers)) {
fldids[i]
}
if (i in fldids) {
printf "%s%s",sep,$i
sep=OFS # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
}
}
print ""
}
Content of MajorMinor.beagle.gz
marker allele1 allele2 FINCH_WB_ID1_splitMerged FINCH_WB_ID1_splitMerged FINCH_WB_ID1_splitMerged FINCH_WB_ID2_splitMerged FINCH_WB_ID2_splitMerged
chr1_34273 G C 0.79924 0.20076 3.18183e-09 0.940649 0.0593509
chr1_34285 G A 0.79924 0.20076 3.18183e-09 0.969347 0.0306534
chr1_34291 G C 0.666111 0.333847 4.20288e-05 0.969347 0.0306534
chr1_34299 C G 0.000251063 0.999498 0.000251063 0.996035 0.00396529
UPDATE:
I was able to get this from this source:
parallel "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{/.}_test.gz'" ::: sub.yr_by_yr/*.subbeagle.txt
The only fancy thing that needs to be removed is the .subbeagle
par of the input file name...
So the parallel tutorial helped me here:
parallel --rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;' "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'" ::: sub.yr_by_yr/*.subbeagle.txt
Let's break this:
--rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;'
--rpl
will "define a shorthand replacement string" (see parallel tutorial and another example here)
{mymy}
is my 'new' replacement string, which will execute what is after it.
s:.*/::;
is the definition to {/}
(see parallel tutorial, search for "Perl expression replacement string", the last part of that section shows the definition of 7 'default' replacement strings)
s:\.[^.]+$::;s:\.[^.]+$::;
removes 2 extensions (so .subbeagle.txt
where .txt
is the first extension and .subbeagle
is the second)
"awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'"
is the subsetting and compressing par of the script. Note that the {mymy}
is where the replacement will take place. As you can see {}
will be in input string. The rest is unchanged!
::: sub.yr_by_yr/*.subbeagle.txt
will pass all the files to parallel as input.
It took ~ 2 hours to do at least ~5 files, but using 22 cores, I could do all files this in a fraction of the time (~20 minutes)!