Search code examples
loopsfileawknumbersraw

Changing snps in 00, 11, 20 in a file to biallelic letter allele using another file which has the nucleotides as map file


I have a raw.txt file:

FID IID FA  MO  SEX PHENO   SNP1    SNP2    SNP3    SNP4
1   1   0   0   1   1   20  00  20  11
1   2   0   0   1   1   11  00  20  20
1   3   0   0   1   1   11  20  11  20
1   4   0   0   1   1   00  11  11  20

A snp.txt file:

1   SNP1    20  A   G
1   SNP2    45  T   C
1   SNP3    56  A   G
1   SNP4    80  C   G

My output file should look like this (after conversion of numbers to from column 7 to letters in raw.txt based on columns 4 and 5 in snp.txt):

FID IID FA  MO  SEX PHENO   SNP1    SNP2    SNP3    SNP4
1   1   0   0   1   1   AA  CC  AA  CG
1   2   0   0   1   1   AG  CC  AA  CC
1   3   0   0   1   1   AG  TT  AG  CC
1   4   0   0   1   1   GG  TC  AG  CC

Column 2 of file snp.txt are the headers for file raw.txt starting from column 7 (raw.txt). Columns 4 and 5 of file snp.txt represent minor and major alleles of snps at column 2. I want columns under SNP1,SNP2, SNP3 and SNP4 which are in the 0,1,2 format to be converted to ACGT format using columns 4 and 5 as the map.

The columns SNP1, SNP2,SNP3 and SNP4 of raw.txt represent 0,1 or 2 copies of minor allele (4th column of snp.txt file). Column 5 is the major allele. If SNP1 is 20 as shown in raw.txt, there are 2 copies of the minor allele, which according to snp.txt is A. Therefore 20 should change to AA (The 2 in 20 is a count of the minor allele A). SNP1 11 indicates that there is 1 copy of the minor allele. Therefore 11 should be AG. SNP1 00 indicates that there is no copy of the minor allele but only major alleles. Therefore 00 should be GG (2 copies of the letter in column 5) of file snp.txt.

In actual fact, I have over 65,000 snps which means there are that much columns for file raw.txt. I have the code below (a code I found on stackoverflow that I edited a bit :

awk 'NR==FNR {a[$2,20]=$4$4; a[$2,11]=$4$5; a[$2,"00"]=$5$5; next} $7~/^[0-2]/ {
     $7=a["SNP1",$7]; $8=a["SNP2",$8];9=a["SNP3",$9];$10=a["SNP4",$10]}1'
snp.txt raw.txt > output.txt

This does what I want if file raw.txt has only 4 snps. I do not know how to make this loop through the fields from column 7 of raw.txt when I have over 65,000 snps. I want a an code (preferably awk language) which can loop through numerous columns of raw.txt to change the snps in 00, 11, 20 format to bi-allelic letter formats. Thank you.


Solution

  • Your awk is good! Here is how to make it for variable number of snps.

    > cat tst.awk 
    NR==FNR {
        snp[$2 "20"] = $4 $4
        snp[$2 "11"] = $4 $5
        snp[$2 "00"] = $5 $5
        next
    }
    
    FNR==1 { # read the columns/snps
        for (i=7;i<=NF;i++) col[i] = $i
        print
        next
    }
    
    {
        for (i=7;i<=NF;i++) $i = snp[col[i] $i]
        print
    }
    

    Usage:

    > awk -f tst.awk snp.txt raw.txt 
    FID IID FA  MO  SEX PHENO   SNP1    SNP2    SNP3    SNP4
    1 1 0 0 1 1 AA CC AA CG
    1 2 0 0 1 1 AG CC AA CC
    1 3 0 0 1 1 AG TT AG CC
    1 4 0 0 1 1 GG TC AG CC
    

    The modification is that we read the header and save the snps, later we use them for the mapping. Both actions are done with a typical for loop, from the column we want to the last column (NF), the rest is what you are already doing, besides some clearer syntax.