I'm trying to find a less time consuming way of splitting fastq files by sequence length, i.e. splitting one big fastq file into multiple ones containing only sequences of the same length. Input is a normal fastq file (4 lines per sequence, with the actual sequence in the second line in every quartet) with varying sequence lengths:
@HISEQ:28:H8P69ADXX:1:1101:1462:2036 1:N:0:CTTGTA
NCCATAAAGTAGAAAGCACT
+
#00<FFFFFFFFFIIFIIFF
@HISEQ:28:H8P69ADXX:1:1101:1419:2156 1:N:0:CTTGTA
TGGAGAGAAAGGCAGTTCCTGA
+
BBBFFFFFFFFFFIIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1378:2223 1:N:0:CTTGTA
TCCTGTACTGAGCTGCCCCGA
+
BBBFFFFFFFFFFIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1585:2081 1:N:0:CTTGTA
AAACCGTTACCATTACTGAGT
+
BBBFFFFFFFFFFIIIIFIII
Right now I'm using awk to filter out sequences of a specific length or within a specific range:
awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) == 22) {print header, seq, qheader, qseq}}'
If I want to have an output file for every single sequence length, I manage with a for loop:
for i in {16..33};
awk -v var=$i 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) == var) {print header, seq, qheader, qseq}}'
done
The problem is, that although it works fine, it is rather time consuming, because I'm checking the whole file for each length separately I guess. Additionally I need to check for the longest and the shortest sequence beforehand.
Can anyone help me finding a more efficient solution than my loop? If possible a solution where I do not have to specify a range but one that checks for the minimal and maximum length and splits them automatically. I would like to do it in awk but I'm open for everything. Thanks Benedikt
something like this?
$ awk '{rec=rec sep $0; sep=ORS}
!(NR%4){print rec > fn; rec=sep=""}
NR%4==2{fn = length($0)".seq"}' file
will generate these 3 files with contents
==> 20.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1462:2036 1:N:0:CTTGTA
NCCATAAAGTAGAAAGCACT
+
#00<FFFFFFFFFIIFIIFF
==> 21.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1378:2223 1:N:0:CTTGTA
TCCTGTACTGAGCTGCCCCGA
+
BBBFFFFFFFFFFIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1585:2081 1:N:0:CTTGTA
AAACCGTTACCATTACTGAGT
+
BBBFFFFFFFFFFIIIIFIII
==> 22.seq <==
@HISEQ:28:H8P69ADXX:1:1101:1419:2156 1:N:0:CTTGTA
TGGAGAGAAAGGCAGTTCCTGA
+
BBBFFFFFFFFFFIIIIIIIII
since there will be a handful of these output files, there is no need to explicitly close them.
Explanation
{rec=rec sep $0; sep=ORS}
build the record line by line with ORS in between lines, with lazy initialization of the separator we can eliminate the dangling first separator.
!(NR%4)
if the line number is a multiple of 4
{print rec > fn; rec=sep=""}
print the record to the file and reset record and separator
NR%4==2
if the line number is a 2 of 4.
{fn = length($0)".seq"}
set the filename