efficient splitting of fastq files by length

I'm trying to find a less time consuming way of splitting fastq files by sequence length, i.e. splitting one big fastq file into multiple ones containing only sequences of the same length. Input is a normal fastq file (4 lines per sequence, with the actual sequence in the second line in every quartet) with varying sequence lengths:

@HISEQ:28:H8P69ADXX:1:1101:1462:2036 1:N:0:CTTGTA
NCCATAAAGTAGAAAGCACT
+
#00<FFFFFFFFFIIFIIFF
@HISEQ:28:H8P69ADXX:1:1101:1419:2156 1:N:0:CTTGTA
TGGAGAGAAAGGCAGTTCCTGA
+
BBBFFFFFFFFFFIIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1378:2223 1:N:0:CTTGTA
TCCTGTACTGAGCTGCCCCGA
+
BBBFFFFFFFFFFIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1585:2081 1:N:0:CTTGTA
AAACCGTTACCATTACTGAGT
+
BBBFFFFFFFFFFIIIIFIII

Right now I'm using awk to filter out sequences of a specific length or within a specific range:

awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) == 22) {print header, seq, qheader, qseq}}'

If I want to have an output file for every single sequence length, I manage with a for loop:

for i in {16..33};
awk -v var=$i 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) == var) {print header, seq, qheader, qseq}}'
done

The problem is, that although it works fine, it is rather time consuming, because I'm checking the whole file for each length separately I guess. Additionally I need to check for the longest and the shortest sequence beforehand.

Can anyone help me finding a more efficient solution than my loop? If possible a solution where I do not have to specify a range but one that checks for the minimal and maximum length and splits them automatically. I would like to do it in awk but I'm open for everything. Thanks Benedikt

Solution

something like this?

$ awk        '{rec=rec sep $0; sep=ORS} 
       !(NR%4){print rec > fn; rec=sep=""} 
       NR%4==2{fn = length($0)".seq"}' file

will generate these 3 files with contents

==> 20.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1462:2036 1:N:0:CTTGTA
NCCATAAAGTAGAAAGCACT
+
#00<FFFFFFFFFIIFIIFF

==> 21.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1378:2223 1:N:0:CTTGTA
TCCTGTACTGAGCTGCCCCGA
+
BBBFFFFFFFFFFIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1585:2081 1:N:0:CTTGTA
AAACCGTTACCATTACTGAGT
+
BBBFFFFFFFFFFIIIIFIII

==> 22.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1419:2156 1:N:0:CTTGTA
TGGAGAGAAAGGCAGTTCCTGA
+
BBBFFFFFFFFFFIIIIIIIII

since there will be a handful of these output files, there is no need to explicitly close them.

Explanation

{rec=rec sep $0; sep=ORS} build the record line by line with ORS in between lines, with lazy initialization of the separator we can eliminate the dangling first separator.

!(NR%4) if the line number is a multiple of 4

{print rec > fn; rec=sep=""} print the record to the file and reset record and separator

NR%4==2 if the line number is a 2 of 4.

{fn = length($0)".seq"} set the filename