Search code examples
awkfastq

efficient splitting of fastq files by length


I'm trying to find a less time consuming way of splitting fastq files by sequence length, i.e. splitting one big fastq file into multiple ones containing only sequences of the same length. Input is a normal fastq file (4 lines per sequence, with the actual sequence in the second line in every quartet) with varying sequence lengths:

@HISEQ:28:H8P69ADXX:1:1101:1462:2036 1:N:0:CTTGTA
NCCATAAAGTAGAAAGCACT
+
#00<FFFFFFFFFIIFIIFF
@HISEQ:28:H8P69ADXX:1:1101:1419:2156 1:N:0:CTTGTA
TGGAGAGAAAGGCAGTTCCTGA
+
BBBFFFFFFFFFFIIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1378:2223 1:N:0:CTTGTA
TCCTGTACTGAGCTGCCCCGA
+
BBBFFFFFFFFFFIIIIIIII
@HISEQ:28:H8P69ADXX:1:1101:1585:2081 1:N:0:CTTGTA
AAACCGTTACCATTACTGAGT
+
BBBFFFFFFFFFFIIIIFIII

Right now I'm using awk to filter out sequences of a specific length or within a specific range:

awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) == 22) {print header, seq, qheader, qseq}}'

If I want to have an output file for every single sequence length, I manage with a for loop:

for i in {16..33};
awk -v var=$i 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) == var) {print header, seq, qheader, qseq}}'
done

The problem is, that although it works fine, it is rather time consuming, because I'm checking the whole file for each length separately I guess. Additionally I need to check for the longest and the shortest sequence beforehand.

Can anyone help me finding a more efficient solution than my loop? If possible a solution where I do not have to specify a range but one that checks for the minimal and maximum length and splits them automatically. I would like to do it in awk but I'm open for everything. Thanks Benedikt


Solution

  • something like this?

    $ awk        '{rec=rec sep $0; sep=ORS} 
           !(NR%4){print rec > fn; rec=sep=""} 
           NR%4==2{fn = length($0)".seq"}' file
    

    will generate these 3 files with contents

    ==> 20.seq <==
    @HISEQ:28:H8P69ADXX:1:1101:1462:2036 1:N:0:CTTGTA
    NCCATAAAGTAGAAAGCACT
    +
    #00<FFFFFFFFFIIFIIFF
    
    ==> 21.seq <==
    @HISEQ:28:H8P69ADXX:1:1101:1378:2223 1:N:0:CTTGTA
    TCCTGTACTGAGCTGCCCCGA
    +
    BBBFFFFFFFFFFIIIIIIII
    @HISEQ:28:H8P69ADXX:1:1101:1585:2081 1:N:0:CTTGTA
    AAACCGTTACCATTACTGAGT
    +
    BBBFFFFFFFFFFIIIIFIII
    
    ==> 22.seq <==
    @HISEQ:28:H8P69ADXX:1:1101:1419:2156 1:N:0:CTTGTA
    TGGAGAGAAAGGCAGTTCCTGA
    +
    BBBFFFFFFFFFFIIIIIIIII
    

    since there will be a handful of these output files, there is no need to explicitly close them.

    Explanation

    {rec=rec sep $0; sep=ORS} build the record line by line with ORS in between lines, with lazy initialization of the separator we can eliminate the dangling first separator.

    !(NR%4) if the line number is a multiple of 4

    {print rec > fn; rec=sep=""} print the record to the file and reset record and separator

    NR%4==2 if the line number is a 2 of 4.

    {fn = length($0)".seq"} set the filename