I want to parse FASTQ file with regular expression block by block. FASTQ file looks like this:
@EAS54_6_R1_2_1_413_324 // seqname
CCCTTCTTGTCTTCAGCGTTTCTCC // seq
+ // seqname #2
;;3;;;;;;;;;;;;7;;;;;;;88 // qual
@EAS54_6_R1_2_1_540_792 // seqname
TTGGCAGGCCAAGGCCGATGGATCA // seq
+ // seqname #2
;;;;;;;;;;;7;;;;;-;;;3;83 // qual
@EAS54_6_R1_2_1_443_348 // seqname
GTTGCTTCTGGCGTGGGTGGGGGGG // seq
+EAS54_6_R1_2_1_443_348 // seqname #2
;;;;;;;;;;;9;7;;.7;393333 // qual
And its format:
<fastq> := <block>+
<block> := @<seqname>\n<seq>\n+[<seqname>]\n<qual>\n
<seqname> := [A-Za-z0-9_.:-]+
<seq> := [A-Za-z\n\.~]+
<qual> := [!-~\n]+
The problem is that I cant detect end of block (or start of next block) because the @
is used in <qual>
block too. But <qual>
block has to be the same size as <seq>
block.
The question: Is it possible to write a regular expression with one group size limited to another group size?
Like this one (except \2.size token):
(?:@([A-Za-z0-9_\.:-]+)\n([A-Za-z\n\.~]+)\n\+([A-Za-z0-9_.:-]*)\n([!-~\n]{\2.size}))*
^.....seqname.....^ ^.....seq......^ ^....seqname2....^ ^qual(should be same size as seq)^
UPDATE: We can't search for @
token because it can appear in <qual>
block
unfortunately, it's not possible to match a{n}b{n} with regex. it requires a context-free grammar; here's a proof.
(instead, just match name, seq, and +, then get match.size()
of the seq match, then read the next n characters from the remaining string to get qual.)