Search code examples
bashawkbioinformaticsvcf-variant-call-format

Outputting two different gsub results to two columns with AWK


I'm trying to optimize an awk code to generate a new columns into my VCF file. This is the input (minimal working example):

#CHROM  POS GENE    REF ALT QUAL    FILTER  INFO    FORMAT  1   2   3   4   5   6
chr14   33925904    EGLN3   A   T   .   .   CSQ=T|  GT:AQ:DP:AD:AB:GQ:PL    0/1:.:0:.:0,0:.:.   1/1:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.
chr14   33925905    EGLN3   G   C   .   .   CSQ=C|  GT:AQ:DP:AD:AB:GQ:PL    ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   0/0:.:0:.:0,0:.:.   1/0:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.
chr14   33925907    EGLN3   G   A   .   .   CSQ=A|  GT:AQ:DP:AD:AB:GQ:PL    ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   0/0:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.
chr14   33925918    EGLN3   T   G   .   .   CSQ=G|  GT:AQ:DP:AD:AB:GQ:PL    ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   1/0:.:0:.:0,0:.:.
chr14   33927014    EGLN3   A   G   .   .   CSQ=G|  GT:AQ:DP:AD:AB:GQ:PL    ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   1/0:.:0:.:0,0:.:.
chr14   33927025    EGLN3   AT  A   .   .   CSQ=-|  GT:AQ:DP:AD:AB:GQ:PL    ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.
chr14   33929079    EGLN3   G   A   .   .   CSQ=A|  GT:AQ:DP:AD:AB:GQ:PL    ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   ./.:.:0:.:0,0:.:.   0/0:.:0:.:0,0:.:.

I use this code:

paste <(bcftools view $MYVCF | awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}') \ #prints first columns from vcf - Important if file has a vcf header
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" |\
    awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}') \ #counts samples (1-6) that have 0/1 or 1/0
    \
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" |\
    awk 'BEGIN {print "nHomAlt"} {print gsub(/1\|1|1\/1/, "")}') \
    \ #counts samples with 1/1
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" |\
    awk 'BEGIN {print "nHomRef"} {print gsub(/0\|0|0\/0/, "")}')   >> example3.txt

  Which does the following:

First, it prints the columns CHR POS ID REF ALT and INFO, then searches for GT values in samples (1-6) with specific 0/0 and prints the counts of samples which have this specific GT in a new column named nHomRef. Does the same to 1/0 (and combinations) and output to column named nHet and then finally does the same to 1/1 combinations and outputs to a new column named hHomRef

This is outputting:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    nHet    nHomAlt nHomRef
chr14   33925904        EGLN3   A       T       .       .       CSQ=T|  1       1       0
chr14   33925905        EGLN3   G       C       .       .       CSQ=C|  1       0       1
chr14   33925907        EGLN3   G       A       .       .       CSQ=A|  0       0       1
chr14   33925918        EGLN3   T       G       .       .       CSQ=G|  1       0       0
chr14   33927014        EGLN3   A       G       .       .       CSQ=G|  1       0       0
chr14   33927025        EGLN3   AT      A       .       .       CSQ=-|  0       0       0
chr14   33929079        EGLN3   G       A       .       .       CSQ=A|  0       0       1

I'm trying to optimize the code so not many subprocesses are created. To simplify I used only two columns

paste <(bcftools view "$MYVCF" | awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" | awk 'BEGIN {print "nHomAlt\tnHet"} {print gsub(/1\|1|1\/1/, "")} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}') >> example3_test.txt

How can I output the gsub, inside awk, to the correct column? So, this snippet

awk 'BEGIN {print "nHomAlt\tnHet"} {print gsub(/1\|1|1\/1/, "")} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}'

To output the correct columns? I`m getting:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    nHomAlt nHet
chr14   33925904    EGLN3   A   T   .   .   CSQ=T|  1
chr14   33925905    EGLN3   G   C   .   .   CSQ=C|  1
chr14   33925907    EGLN3   G   A   .   .   CSQ=A|  0
chr14   33925918    EGLN3   T   G   .   .   CSQ=G|  1
chr14   33927014    EGLN3   A   G   .   .   CSQ=G|  0
chr14   33927025    EGLN3   AT  A   .   .   CSQ=-|  0
chr14   33929079    EGLN3   G   A   .   .   CSQ=A|  0
    1
    0
    1
    0
    0
    0
    0

PS: Using

paste <(bcftools view $MYVCF | awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}')  <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" | awk 'BEGIN {OFS="\t"; print "nHomAlt\tnHet"} {
 nHet=gsub(/0\|1|1\|0|0\/1|1\/0/, "");
 nHomAlt=gsub(/1\|1|1\/1/, "");
 print nHomAlt,nHet;
}')

Seems to give the correct output!


Solution

  • First, any time your goal includes "to optimize", your first effort should be to eliminate repeating any work... such as the repeats of

    bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" | awk 
    

    Part of your problem is trying to push your data through paste when you already have it in awk. I think what you really need to do is remember that print is going to append an output record separator, or ORS, which by default is a newline character. You could use printf to output each field, but I think it's easier to just pass your fields as arguments. You can also set the OFS (output field separator) to delimit them with a tab by default.

    Working from your input example, which I stored in a file, you probably just need to manage your data collection and formatting. Here's one possible example:

    $: awk 'BEGIN {OFS="\t"; print "nHomAlt\tnHet"} {
     nHet=gsub(/0\|1|1\|0|0\/1|1\/0/, "");
     nHomAlt=gsub(/1\|1|1\/1/, "");
     print nHomAlt,nHet;
    }' input
    nHomAlt nHet
    1       3
    0       3
    3       2
    0       1
    0       3
    0       0
    0       7
    0       0
    

    You said you wanted counts, and that's what gsub returns, so I assume that's why you used it.

    What you want probably looks something like this:

    awk 'BEGIN { OFS="\t"; print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tnHet\tnHomAlt\tnHomRef" } {
     nHet=gsub(/0[|/]1|1[|/]0/, "");
     nHomAlt=gsub(/1[|/]1/, "");
     nHomRef=gsub(/0[|/]0/, "");
     print $1,$2,$3,$4,$5,$6,$7,$8,nHet,nHomAlt,nHomRef;
    }' input # maybe input should be < <( <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" )
    

    ...which shouldn't need the paste or the repeated calls to awk.

    You obviously already know how to substitute your command in for the input. Hopefully that will get you close enough.