Search code examples
bashawksedbioinformaticsgff

How to print fields containing specific substrings with awk?


Goal: To print fields from an input file to an output file, but with special regard to specific fields split by semicolons (; — see field nine of example input below).

Example Input (input.txt):

NC_051336.1     Gnomon  gene    40042   56215   .       -       .       gene_id "LOC102552844"; transcript_id ""; db_xref "GeneID:102552844"; db_xref "RGD:7551986"; gbkey "Gene"; gene "LOC102552844"; gene_biotype "pseudogene"; pseudo "true";
NC_051336.1     BestRefSeq%2CGnomon     gene    76909   85762   .       +       .       gene_id "Vom2r3"; transcript_id ""; db_xref "GeneID:502213"; db_xref "RGD:1565892"; description "vomeronasal 2 receptor, 3"; gbkey "Gene"; gene "Vom2r3"; gene_biotype "protein_coding"; gene_synonym "RGD1565892";
NC_051336.1     Gnomon  gene    162525  192445  .       -       .       gene_id "LOC103693496"; transcript_id ""; db_xref "GeneID:103693496"; db_xref "RGD:9090555"; gbkey "Gene"; gene "LOC103693496"; gene_biotype "protein_coding";
NC_051336.1     Gnomon  transcript      162525  167758  .       -       .       gene_id "LOC103693496"; transcript_id "XM_039098304.1"; db_xref "GeneID:103693496"; gbkey "mRNA"; gene "LOC103693496"; model_evidence "Supporting evidence includes similarity to: 4 Proteins, and 90% coverage of the annotated genomic feature by RNAseq alignments"; product "sperm motility kinase X-like, transcript variant X9"; transcript_biotype "mRNA";
NC_051336.1     BestRefSeq      gene    226465  241532  .       -       .       gene_id "Vom2r4"; transcript_id ""; db_xref "GeneID:308248"; db_xref "RGD:1564110"; description "vomeronasal 2 receptor, 4"; gbkey "Gene"; gene "Vom2r4"; gene_biotype "protein_coding"; gene_synonym "RGD1564110";
NC_051336.1     BestRefSeq      transcript      226465  241532  .       -       .       gene_id "Vom2r4"; transcript_id "NM_001099458.1"; db_xref "GeneID:308248"; gbkey "mRNA"; gene "Vom2r4"; product "vomeronasal 2 receptor, 4"; transcript_biotype "mRNA";
NC_051336.1     Curated Genomic gene    267100  275769  .       -       .       gene_id "Vom2r-ps8"; transcript_id ""; db_xref "GeneID:502214"; db_xref "RGD:1563053"; description "vomeronasal 2 receptor, pseudogene 8"; gbkey "Gene"; gene "Vom2r-ps8"; gene_biotype "pseudogene"; gene_synonym "RGD1563053"; pseudo "true";
NC_051336.1     Gnomon  gene    284195  301267  .       -       .       gene_id "LOC102556157"; transcript_id ""; db_xref "GeneID:102556157"; db_xref "RGD:7626961"; gbkey "Gene"; gene "LOC102556157"; gene_biotype "protein_coding";
NC_051336.1     Gnomon  gene    307758  313908  .       -       .       gene_id "LOC108352169"; transcript_id ""; db_xref "GeneID:108352169"; db_xref "RGD:11507047"; gbkey "Gene"; gene "LOC108352169"; gene_biotype "lncRNA";
NC_051336.1     Gnomon  transcript      307758  313908  .       -       .       gene_id "LOC108352169"; transcript_id "XR_005497081.1"; db_xref "GeneID:108352169"; gbkey "ncRNA"; gene "LOC108352169"; model_evidence "Supporting evidence includes similarity to: 1 EST, and 98% coverage of the annotated genomic feature by RNAseq alignments, including 3 samples with support for all annotated introns"; product "uncharacterized LOC108352169, transcript variant X3"; transcript_biotype "lnc_RNA";

Desired output (output.txt):

LOC102552844    LOC102552844    NC_051336.1:40042-56215 NC_051336.1     40042   56215   -       16173   pseudogene
Vom2r3  Vom2r3          NC_051336.1:76909-85762 NC_051336.1     76909   85762   +       8853    protein_coding
LOC103693496    LOC103693496    NC_051336.1:162525-192445       NC_051336.1     162525  192445  -       29920   protein_coding
Vom2r4  Vom2r4          NC_051336.1:226465-241532       NC_051336.1     226465  241532  -       15067   protein_coding
Vom2r-ps8       Vom2r-ps8       NC_051336.1:267100-275769       NC_051336.1     267100  275769  -       8669    pseudogene
LOC102556157    LOC102556157    NC_051336.1:284195-301267       NC_051336.1     284195  301267  -       17072   protein_coding
LOC108352169    LOC108352169    NC_051336.1:307758-313908       NC_051336.1     307758  313908  -       6150    lncRNA

For only those lines of the input where $3 is "gene"...
   Column 1:  "gene_id" substring of $9
   Column 2:  "gene" substring of $9
   Column 3:  $1 ":" $4 "-" $5
   Column 4:  $1
   Column 5:  $4
   Column 6:  $5
   Column 7:  $7
   Column 8:  $5-$4
   Column 9:  "gene_biotype" substring of $9

Attempts:

1.
$ cat input.txt | awk '                                                           # provide the input to awk
  BEGIN{FS="\t"}                                                                  # use tab as the field separator
  {split($9,a,";");                                                               # split field 9 into individual fields using semicolon as the field separator
  if($3~"gene")                                                                   # if the third field is "gene"
  print a[1]"\t"a[6]"\t"$1":"$4"-"$5"\t"$1"\t"$4"\t"$5"\t"$7"\t"$5-$4"\t"a[7]}' | # print the indicated fields of each line in a tab-delimited fashion
  sed 's/gene_id "//' |                                                           # remove any instance of the string gene_id "
  sed 's/gene "//' |                                                              # remove any instance of the string gene "
  sed 's/gene_biotype "//' |                                                      # remove any instance of the string gene_biotype "
  sed 's/[" ]//g' > output.txt                                                    # remove any instance of the character " globally and send output to file
$ cat output.txt
LOC102552844    LOC102552844    NC_051336.1:40042-56215 NC_051336.1     40042   56215   -       16173   pseudogene
Vom2r3  gbkeyGene       NC_051336.1:76909-85762 NC_051336.1     76909   85762   +       8853    Vom2r3
LOC103693496    LOC103693496    NC_051336.1:162525-192445       NC_051336.1     162525  192445  -       29920   protein_coding
Vom2r4  gbkeyGene       NC_051336.1:226465-241532       NC_051336.1     226465  241532  -       15067   Vom2r4
Vom2r-ps8       gbkeyGene       NC_051336.1:267100-275769       NC_051336.1     267100  275769  -       8669    Vom2r-ps8
LOC102556157    LOC102556157    NC_051336.1:284195-301267       NC_051336.1     284195  301267  -       17072   protein_coding
LOC108352169    LOC108352169    NC_051336.1:307758-313908       NC_051336.1     307758  313908  -       6150    lncRNA

This comes really close to what I want... except that output lines 2,4,5 are incorrect. I want the second output field to be a substring of the "gene "" input field and the ninth output field to be a substring of the "gene_biotype "" input field (both desired input fields have variable positions in the semicolon separated series of the ninth field).

While printing a[1] to extract the "gene_id "" field does work since it is ALWAYS in the first field of the ninth semicolon-delimited input field, printing a[6] and a[7] to extract the other fields mentioned above will NOT work because their index position changes for some lines (see 2,4,5 of the output). The obvious solution is to print a field only when it contains the string of interest rather than a specific index position...

2.
$ cat $ANNOTATION | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print {for(i=1;i<=NF;i++){if($i~/gene_id "/){print a[$i]}}}"\t"{for(i=1;i<=NF;i++){if($i~/gene "/){print a[$i]}}}"\t"$1":"$4"-"$5"\t"$1"\t"$4"\t"$5"\t"$7"\t"$5-$4"\t"{for(i=1;i<=NF;i++){if($i~/gene_biotype "/){print a[$i]}}}}' | sed 's/gene_id "//' | sed 's/gene "//' | sed 's/gene_biotype "//' | sed 's/[" ]//g' > output.txt
$ cat output.txt

This awk command is riddled with syntax errors since using { or } or ; after print is not acceptable, thus the command is not executed and there is no output.
The general idea was to replace the index positions with awk-friendly regex code to extract fields containing specific substrings: {for(i=1;i<=NF;i++){if($i~/gene_id "/){print a[$i]}}}. However, this will not work if the syntax is improper, like above.


Note: I am using...

  • Shell version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
  • AWK version: GNU Awk 4.0.2
  • sed version: sed (GNU sed) 4.2.2

Solution

  • Assumptions:

    • the last/9th field always contains entries for gene, gene_id and gene_biotype (otherwise this answer is going to print an empty string where the missing entry would show up in the output)

    One awk idea:

    awk '
    BEGIN        { FS=OFS="\t" }
    $3 == "gene" { delete attr                    # reset attr[] array
                   split($NF,a,";")               # split last field on ";"
                   for (i in a) {                 # loop through array
                       split(a[i],b,"\"")         # split array entry on double quote; result: b[1]==attribute name, b[2]==attribute value; we do not care about b[3]
                       gsub(/ /,"",b[1])          # strip spaces from the attribute name
                       attr[b[1]]=b[2]            # populate attr[] array
                   }
                   print attr["gene_id"],attr["gene"],$1":"$4"-"$5,$1,$4,$5,$7,($5-$4),attr["gene_biotype"]
                 }
    ' input.tsv
    

    This generates:

    LOC102552844    LOC102552844    NC_051336.1:40042-56215 NC_051336.1     40042   56215   -       16173   pseudogene
    Vom2r3  Vom2r3  NC_051336.1:76909-85762 NC_051336.1     76909   85762   +       8853    protein_coding
    LOC103693496    LOC103693496    NC_051336.1:162525-192445       NC_051336.1     162525  192445  -       29920 protein_coding
    Vom2r4  Vom2r4  NC_051336.1:226465-241532       NC_051336.1     226465  241532  -       15067   protein_coding
    Vom2r-ps8       Vom2r-ps8       NC_051336.1:267100-275769       NC_051336.1     267100  275769  -       8669  pseudogene
    LOC102556157    LOC102556157    NC_051336.1:284195-301267       NC_051336.1     284195  301267  -       17072 protein_coding
    LOC108352169    LOC108352169    NC_051336.1:307758-313908       NC_051336.1     307758  313908  -       6150  lncRNA