Search code examples
awkpattern-matchingfastq

Match specific pattern and print just the matched string in the previous line


I update the question with additional information

I have a .fastq file formatted in the following way

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+ 
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)

For each sequence the format is the same (repetition of 4 lines) What I am trying to do is searching for a specific regex pattern ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.

So far I've written a bunch of code that does almost what I want.I thought using the match function together wit the substr of my window of interest but i didn't achieve my goal. I report below the script.awk :

match(substr($0,0,35),/regexp/,a) {
    print p,a[0] #print the previous line respect to the matched one
    print #print the current line
    for(i=0;i<=1;i++) { # print the 2 lines following
        getline
        print
    }
}#store previous line 
{ p = $0 }

Starting from a file like this:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. 
+ 
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..

I would like to obtain an output like this:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA 
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line

Solution

  • $ cat tst.awk
    BEGIN {
        tgtStr   = "pattern"
        tgtLgth  = length(tgtStr)
        winLgth  = 35
        numLines = 4
    }
    {
        lineNr = ( (NR-1) % numLines ) + 1
        rec[lineNr] = $0
    }
    lineNr == numLines {
        if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
            rec[1] = rec[1] " " tgtStr
            rec[2] = substr(rec[2],idx+tgtLgth)
            rec[4] = substr(rec[4],idx+tgtLgth)
        }
        for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
            print rec[lineNr]
        }
    }
    
    $ awk -f tst.awk file
    @M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8  pattern
    ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
    +
    GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
    

    wrt the code you posted:

    • substr($0,0,35) - strings, fields, line numbers, and arrays in awk start at 1 not 0 so that should be substr($0,1,35). Awk will compensate for your mistake and treat it as if you had written 1 instead of 0 in this case but get used to starting everything at 1 to avoid mistakes when it matters.
    • for(i=0;i<=1;i++) - should be for(i=1;i<=2;i++) for the same reason.
    • getline - not an appropriate use and syntactically fragile, see for(i=0;i<=1;i++)

    Update - per your comment below that pattern is actually a regexp rather than a string:

    $ cat tst.awk
    BEGIN {
        tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
        winLgth   = 35
        numLines  = 4
    }
    {
        lineNr = ( (NR-1) % numLines ) + 1
        rec[lineNr] = $0
    }
    lineNr == numLines {
        if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
            rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
            rec[2] = substr(rec[2],RSTART+RLENGTH)
            rec[4] = substr(rec[4],RSTART+RLENGTH)
        }
        for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
            print rec[lineNr]
        }
    }