Search code examples
bashunixbioinformaticsvcf-variant-call-format

Replace numeric genotype code with DNA letter


how can i replace the numeric genotype code with a DNA letter? i have a modified vcf file that looks like that:

POS REF ALT A2.bam C10.bam 448 T C 0/0:0,255,255 0/0:0,255,255 2402 C T 1/1:209,23,0 xxx:255,0,255 n...

i want to replace the 0/0 with the ref letter, 1/1 with the alt letter and delete all the string after it. it should look like this:

POS REF ALT A2.bam C10.bam 448 T C T T 2402 C G G xxx n...

been trying to do it with sed but it didn't work don't know how to approach it


Solution

  • Would you please try:

    awk '{
        if (NR > 1) {
            for (i=4; i<=5; i++) {
                split($i, a, ":")
                $i = a[1]
                if ($i == "0/0") $i = $2
                if ($i == "1/1") $i = $3
            }
        }
        print
    }' file.txt
    

    Output:

    POS  REF ALT     A2.bam C10.bam
    448 T C T T
    2402 C T T xxx
    n...    
    
    • The for loop processes the 4th and 5th columns (A2.bam and C10.bam).
    • First it chops off the substring after ":".
    • If the remaining value is equal to "0/0", then replace it with the 2nd column (REF).
    • In case of "1/1", use the 3rd column (ALT).

    Hope this helps.