Search code examples
linuxbashshellawkfastq

Use awk to find regex of variable length and edit following lines based on length found


I am trying to edit a fastq-file with awk.

@someheader example fastq file
TGTACTTAGAGAAGCGC
+
BDDADHHIHHHIICHIG
@nextheader
CCGTAACCTGGGCAGTG
+
DDDDDHIIIIIIIIIII

What I want to achieve is:

  • look for the following regex: /AGATCGGAAG[ATGC]{0,24}$/ - if possible only in lines where its actually possible to find (e.g. line 2, 6, 10, x+2%4=0 basically)
  • If found, remove the match
  • Then remove the same number of characters at the end 2 lines after the current line

So far, editing one line based on the regex was no problem for me, i used:

awk '{ gsub(/AGATCGGAAG[ATGC]{0,24}$/, ""); print RLENGTH }'

But I have no idea how i can achieve deleting the same number of characters 2 lines later. I am very unexperienced and only started learning about awk, so any help is welcome.

greetings

EDIT: heres an example containing the pattern above

@HWI-ST558:329:H3K2GBCXX:1:1101:5408:2985 1:N:0:ATCACG
CCTCCCGGTCGGTGCTGAGAGAGACTGGGCTCTCTGGAACTCCACCACCGAGATCGGAAGAG
+
HHHIIIIHDHIIIHIIGHHHIHFHHCHHIE?GHHGHF?GECFEEHFHHHCHDHHHFEEHHHH

this should be the output:

@HWI-ST558:329:H3K2GBCXX:1:1101:5408:2985 1:N:0:ATCACG
CCTCCCGGTCGGTGCTGAGAGAGACTGGGCTCTCTGGAACTCCACCACCG
+
HHHIIIIHDHIIIHIIGHHHIHFHHCHHIE?GHHGHF?GECFEEHFHHHC

the file contains 40 million of these entries, with ~250k containing the pattern


Solution

  • This might work but since your sample input doesn't contain any lines that'd match the regexp and you didn't provide any expected output, of course it's untested:

    NR%4 == 2 { match($0,/AGATCGGAAG[ATGC]{0,24}$/) }
    RSTART && (NR%4 ~ /^[02]$/) { $0 = substr($0,1,RSTART-1) }
    { print }