Search code examples
bashawkfastq

awk NR==FNR for command syntax


I am having trouble using awk NR==FNR to return lines of interest from an input .fastq file.

I have the following example input file called example.fastq

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

I am trying to extract groups of four lines that contain a string of interest, importantly approximate matches must be allowed hence the use of agrep instead of grep. The below example works.

agrep -1 -n "GAAATAATA" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq

The above command produces the following correct output.

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

However if I use a sequence not contained in the second line this command still prints the top two lines as in the following example.

agrep -1 -n "TAGATAAAACT" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

Thanks for helping me understand the behavior of this awk command.


Solution

  • You may use this agrep + awk solution:

    srch() {
       awk -F ': ' 'NR==FNR {
          a[$1] = 1
          next
       }
       a[FNR] {
          print p
          print
          for (i=0; i<2 && getline > 0; i++)
             print
       }
       {
          p=$0
       }' <(agrep -1 -n "$2" "$1") "$1"
    }
    

    Then run it as:

    srch file 'GAAATAATA'
    

    @SRR1111111.1 1/1
    CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
    +
    AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
    @SRR1111111.3 3/1
    CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
    +
    AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
    

    and this:

    srch file 'TAGATAAAACT
    

    @SRR1111111.3 3/1
    CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
    +
    AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6'