I am working on many large gz file like the below examples (only the first 5 rows are showed here).
gene_id variant_id tss_distance ma_samples ma_count maf pval_nominal slope slope_se
ENSG00000223972.4 1_13417_C_CGAGA_b37 1548 50 50 0.0766871 0.735446 -0.0468165 0.138428
ENSG00000223972.4 1_17559_G_C_b37 5690 7 7 0.00964187 0.39765 -0.287573 0.339508
ENSG00000223972.4 1_54421_A_G_b37 42552 28 28 0.039548 0.680357 0.0741142 0.179725
ENSG00000223972.4 1_54490_G_A_b37 42621 112 120 0.176471 0.00824733 0.247533 0.093081
Below is the output that I want.
Here, I split the second column by "_", and selected the rows based on the second and third columns (after splitting) ($2==1 and $3>20000). And I save it as a txt. The command below works perfectly.
zcat InputData.txt.gz | awk -F "_" '$1=$1' | awk '{if ($2==1 && $3>20000) {print}}' > OutputData.txt
ENSG00000223972.4 1 54421 A G b37 42552 28 28 0.039548 0.680357 0.0741142 0.179725
ENSG00000223972.4 1 54490 G A b37 42621 112 120 0.176471 0.00824733 0.247533 0.093081
But I want to use GNU parallel to speed up the process since I have many large gz files to work with. However, there seems to be some conflict between GNU parallel and awk, probably in terms of the quotation?
I tried defining the awk option separately as below, but it did not give me anything in the output file.
In the below command, I am only running the parallel on one input file. But I want to run in on multiple input files, and save multiple output files each corresponding to one input file.
For example,
InputData_1.txt.gz to OutputData_1.txt
InputData_2.txt.gz to OutputData_2.txt
awk1='{ -F "_" "$1=$1" }'
awk2='{if ($2==1 && $3>20000) {print}}'
parallel "zcat {} | awk '$awk1' |awk '$awk2' > OutputData.txt" ::: InputData.txt.gz
Does anyone have any suggestion on this task? Thank you very much.
According to the suggestion from @karakfa, this is one solution
chr=1
RegionStart=10000
RegionEnd=50000
zcat InputData.txt.gz | awk -v chr=$chr -v start=$RegionStart -v end=$RegionEnd '{split($2,NewDF,"_")} NewDF[1]==chr && NewDF[2]>start && NewDF[2]<end {gsub("_"," ",$2) ; print > ("OutputData.txt")}'
#This also works using parallel
awkbody='{split($2,NewDF,"_")} NewDF[1]==chr && NewDF[2]>start && NewDF[2]<end {gsub("_"," ",$2) ; print > ("{}_OutputData.txt")}'
parallel "zcat {} | awk -v chr=$chr -v start=$RegionStart -v end=$RegionEnd '$awkbody' " ::: InputData_*.txt.gz
The output file name for the input file InputData_1.txt.gz
will be InputData_1.txt.gz_OutputData.txt
https://www.gnu.org/software/parallel/man.html#QUOTING concludes:
Conclusion: To avoid dealing with the quoting problems it may be easier just to write a small script or a function (remember to export -f the function) and have GNU parallel call that.
So:
doit() {
zcat "$1" |
awk -F "_" '$1=$1' |
awk '{if ($2==1 && $3>20000) {print}}'
}
export -f doit
parallel 'doit {} > {=s/In/Out/; s/.gz//=}' ::: InputData*.txt.gz