I am trying to edit a fastq-file with awk.
@someheader example fastq file
TGTACTTAGAGAAGCGC
+
BDDADHHIHHHIICHIG
@nextheader
CCGTAACCTGGGCAGTG
+
DDDDDHIIIIIIIIIII
What I want to achieve is:
So far, editing one line based on the regex was no problem for me, i used:
awk '{ gsub(/AGATCGGAAG[ATGC]{0,24}$/, ""); print RLENGTH }'
But I have no idea how i can achieve deleting the same number of characters 2 lines later. I am very unexperienced and only started learning about awk, so any help is welcome.
greetings
EDIT: heres an example containing the pattern above
@HWI-ST558:329:H3K2GBCXX:1:1101:5408:2985 1:N:0:ATCACG
CCTCCCGGTCGGTGCTGAGAGAGACTGGGCTCTCTGGAACTCCACCACCGAGATCGGAAGAG
+
HHHIIIIHDHIIIHIIGHHHIHFHHCHHIE?GHHGHF?GECFEEHFHHHCHDHHHFEEHHHH
this should be the output:
@HWI-ST558:329:H3K2GBCXX:1:1101:5408:2985 1:N:0:ATCACG
CCTCCCGGTCGGTGCTGAGAGAGACTGGGCTCTCTGGAACTCCACCACCG
+
HHHIIIIHDHIIIHIIGHHHIHFHHCHHIE?GHHGHF?GECFEEHFHHHC
the file contains 40 million of these entries, with ~250k containing the pattern
This might work but since your sample input doesn't contain any lines that'd match the regexp and you didn't provide any expected output, of course it's untested:
NR%4 == 2 { match($0,/AGATCGGAAG[ATGC]{0,24}$/) }
RSTART && (NR%4 ~ /^[02]$/) { $0 = substr($0,1,RSTART-1) }
{ print }