Search code examples
awkgenomebed

AWK to handle bed files


I would like to grep and separate fields from bed files to generate a new bed file with these new arranged data.

I would go from here:

1   15903   rs557514207 G   G,A RS=557514207;RSPOS=15903;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000005150026000200;GENEINFO=WASH7P:653635;WGT=1;VC=DIV;ASP;VLD;G5;KGPhase3;CAF=0.5589,.,0.4411;COMMON=1;TOPMED=0.30307084607543323,0.00039022680937818,0.69653892711518858`
1   11012   rs544419019 C   G   RS=544419019;RSPOS=11012;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000020005150024000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP;VLD;G5;KGPhase3;CAF=0.9119,0.08806;COMMON=1`
1   15903   rs557514207 G   G,C RS=557514207;RSPOS=15903;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000005150026000200;GENEINFO=WASH7P:653635;WGT=1;VC=DIV;ASP;VLD;G5;KGPhase3;CAF=0.5589,.,0.4411;COMMON=1;TOPMED=0.30307084607543323,0.00039022680937818,0.69653892711518858

To here:

1   15903   rs557514207 G   G   CAF=0.5589,.
1   15903   rs557514207 G   A   CAF=0.5589,0.4411
1   11012   rs544419019 C   G   CAF=0.9119,0.08806
1   15903   rs557514207 G   G   CAF=0.5589,.
1   15903   rs557514207 G   C   CAF=0.5589,0.4411

So separating column 5 by comma and add a new line and separating column 6 by Word CAF= and also the values that correspond to column 5 and keep the information in the new lines. Column 6 includes a strings, concatenated by semicolon. I'm interessted in the part ;CAF=value1,value2; between the semicolon. Resulting in this example into two new lines CAF=value1 CAF=value2, which is connected to the split of G,A two new lines for G and A.


Solution

  • awk -F'\t' -v OFS='\t' '
      {
        # split column 6; CAF part starts from element 2
        split($6, c6, /^.*CAF=|,|;.*$/)
    
        # split column 5
        n=split($5, c5, /,/)
    
        # print initial columns and relevant parts of 5 and 6
        for (i=1; i<=n; i++)
          print $1,$2,$3,$4, c5[i], "CAF="c6[2]","c6[2+i]
      }
    ' infile >outfile