Search code examples
regexfastq

Regex search block with size equal to previous block


I want to parse FASTQ file with regular expression block by block. FASTQ file looks like this:

@EAS54_6_R1_2_1_413_324     // seqname
CCCTTCTTGTCTTCAGCGTTTCTCC   // seq
+                           // seqname #2
;;3;;;;;;;;;;;;7;;;;;;;88   // qual
@EAS54_6_R1_2_1_540_792     // seqname
TTGGCAGGCCAAGGCCGATGGATCA   // seq
+                           // seqname #2
;;;;;;;;;;;7;;;;;-;;;3;83   // qual
@EAS54_6_R1_2_1_443_348     // seqname
GTTGCTTCTGGCGTGGGTGGGGGGG   // seq
+EAS54_6_R1_2_1_443_348     // seqname #2
;;;;;;;;;;;9;7;;.7;393333   // qual

And its format:

<fastq>     :=  <block>+
<block>     :=  @<seqname>\n<seq>\n+[<seqname>]\n<qual>\n
<seqname>   :=  [A-Za-z0-9_.:-]+
<seq>       :=  [A-Za-z\n\.~]+
<qual>      :=  [!-~\n]+

The problem is that I cant detect end of block (or start of next block) because the @ is used in <qual> block too. But <qual> block has to be the same size as <seq> block.

The question: Is it possible to write a regular expression with one group size limited to another group size?

Like this one (except \2.size token):

(?:@([A-Za-z0-9_\.:-]+)\n([A-Za-z\n\.~]+)\n\+([A-Za-z0-9_.:-]*)\n([!-~\n]{\2.size}))*
    ^.....seqname.....^  ^.....seq......^    ^....seqname2....^  ^qual(should be same size as seq)^

UPDATE: We can't search for @ token because it can appear in <qual> block


Solution

  • unfortunately, it's not possible to match a{n}b{n} with regex. it requires a context-free grammar; here's a proof.

    (instead, just match name, seq, and +, then get match.size() of the seq match, then read the next n characters from the remaining string to get qual.)