Search code examples
awkgnu-parallel

Conflict between GNU parallel and awk (split a column and filter some rows)


I am working on many large gz file like the below examples (only the first 5 rows are showed here).

gene_id variant_id  tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se
ENSG00000223972.4   1_13417_C_CGAGA_b37 1548    50  50  0.0766871   0.735446    -0.0468165  0.138428
ENSG00000223972.4   1_17559_G_C_b37 5690    7   7   0.00964187  0.39765 -0.287573   0.339508
ENSG00000223972.4   1_54421_A_G_b37 42552   28  28  0.039548    0.680357    0.0741142   0.179725
ENSG00000223972.4   1_54490_G_A_b37 42621   112 120 0.176471    0.00824733  0.247533    0.093081

Below is the output that I want.

Here, I split the second column by "_", and selected the rows based on the second and third columns (after splitting) ($2==1 and $3>20000). And I save it as a txt. The command below works perfectly.

zcat InputData.txt.gz | awk -F "_"  '$1=$1' | awk '{if ($2==1 && $3>20000) {print}}'  > OutputData.txt

ENSG00000223972.4   1 54421 A G b37 42552   28  28  0.039548    0.680357    0.0741142   0.179725
ENSG00000223972.4   1 54490 G A b37 42621   112 120 0.176471    0.00824733  0.247533    0.093081

But I want to use GNU parallel to speed up the process since I have many large gz files to work with. However, there seems to be some conflict between GNU parallel and awk, probably in terms of the quotation?

I tried defining the awk option separately as below, but it did not give me anything in the output file.

In the below command, I am only running the parallel on one input file. But I want to run in on multiple input files, and save multiple output files each corresponding to one input file.

For example,

InputData_1.txt.gz to OutputData_1.txt

InputData_2.txt.gz to OutputData_2.txt

awk1='{ -F "_"  "$1=$1" }'
awk2='{if ($2==1 && $3>20000) {print}}' 
parallel "zcat {} | awk '$awk1' |awk '$awk2' > OutputData.txt" ::: InputData.txt.gz

Does anyone have any suggestion on this task? Thank you very much.


According to the suggestion from @karakfa, this is one solution

chr=1
RegionStart=10000
RegionEnd=50000
zcat InputData.txt.gz | awk -v chr=$chr -v start=$RegionStart -v end=$RegionEnd '{split($2,NewDF,"_")} NewDF[1]==chr && NewDF[2]>start && NewDF[2]<end {gsub("_"," ",$2) ; print > ("OutputData.txt")}' 

#This also works using parallel

awkbody='{split($2,NewDF,"_")} NewDF[1]==chr && NewDF[2]>start && NewDF[2]<end {gsub("_"," ",$2) ; print > ("{}_OutputData.txt")}'
parallel "zcat {} | awk -v chr=$chr -v start=$RegionStart -v end=$RegionEnd '$awkbody' " ::: InputData_*.txt.gz

The output file name for the input file InputData_1.txt.gz will be InputData_1.txt.gz_OutputData.txt


Solution

  • https://www.gnu.org/software/parallel/man.html#QUOTING concludes:

    Conclusion: To avoid dealing with the quoting problems it may be easier just to write a small script or a function (remember to export -f the function) and have GNU parallel call that.

    So:

    doit() {
      zcat "$1" |
        awk -F "_"  '$1=$1' |
        awk '{if ($2==1 && $3>20000) {print}}'
    }
    export -f doit
    parallel 'doit {} > {=s/In/Out/; s/.gz//=}' ::: InputData*.txt.gz