Search code examples
grepfastq

grep every fourth line in .fastq


I am working on a linux machine using bash.

My question is, how can I skip lines in the query file using grep?

I am working with a large ~16Gb .fastq file named example.fastq which has the following format.

example.fastq

@SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR6750041.3 3/1
ATCCANAATGATGTGTTGCTCTGGAGGTACAGAGATAACGTCAGCTGGAATAGTTTCCCCTCACAG
+
AAAAA#EE6E6EEEEEE6EEEEAEEEEEEEEEEE//EAEEEEEAAEAEEEAE/EAEEA6/EEA<E/
@SRR6750041.4 4/1
ACACCNAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGAGGCCAGAGCATTCTTCGTACA
+
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/E/<//AEA/EA//E//
@SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E

I need to extract lines containing a strings of interest @SRR6750041.2 @SRR6750041.5 stored in a bash array called IDarray as well as the 3 lines following each match. The following grep command allows me to do this

for ID in "${IDarray[@]}";
    do
    grep -F -A 3 "$ID " example.fastq 
    done

This correctly output the following.

@SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E

I am looking for ways to speed this process up... one way would be to reduce the number of lines searched by grep by restricting the search to lines beginning with @ or skipping lines that can not possibly contain the match @SRR6750041.1 such as lines 2,3,4 and 6,7,8 etc. Is there a way to do this using grep? Alternative methods are also welcome!


Solution

  • Here are some thoughts with examples. For test purposes I created test case as mini version of Yours example_mini.fastq is 145 MB big and IDarray has 999 elements (interests).

    Your version has this performance (more than 2 mins in user space):

    $ time for i in "${arr[@]}"; do grep -A 3 "${i}" example_mini.fastq; done 1> out.txt
    real    3m16.310s
    user    2m9.645s
    sys     0m53.092s
    
    $ md5sum out.txt
    8f199a78465f561fff3cbe98ab792262  out.txt
    

    First upgrade of grep to end grep after first match -m 1, I am assuming that interest ID is unique. This narrow down by 50% of complexity and takes approx 1 min in user space:

    $ time for i in "${arr[@]}"; do grep -m 1 -A 3 "${i}" example_mini.fastq; done 1> out.txt
    real    1m19.325s
    user    0m55.844s
    sys     0m21.260s
    
    $ md5sum out.txt
    8f199a78465f561fff3cbe98ab792262  out.txt
    

    These solutions are linearly dependent on number of elements. Call n times grep on huge file.

    Now let's implement in AWK only for one run, I am exporting IDarray into input file so I can process in one run. I am loading big file into associative array per ID and then looping 1x through You array of IDs to search. This is generic scenario where You can define regexp and number of lines after to print. This has complexity with only one run through file + N comparisons. This is 2000% speed up:

    $ for i in "${arr[@]}"; do echo $i; done > IDarray.txt
    $ time awk '
    (FNR==NR) && (linesafter-- > 0) { arr[interest]=arr[interest] RS $0; next; }
    (FNR==NR) && /^@/ { interest=$1; arr[interest]=$0; linesafter=3; next; }
    (FNR!=NR) && arr[$1] { print(arr[$1]); }
    ' example_mini.fastq IDarray.txt 1> out.txt
    real    0m7.044s
    user    0m6.628s
    sys     0m0.307s
    
    $ md5sum out.txt
    8f199a78465f561fff3cbe98ab792262  out.txt
    

    As in Your title If You really can confirm that every fourth line is id of interest and three lines after are about to be printed. You can simplify into this and speed up by another 20%:

    $ for i in "${arr[@]}"; do echo $i; done > IDarray.txt
    $ time awk '
    (FNR==NR) && (FNR%4==1) { interest=$1; arr[interest]=$0; next; }
    (FNR==NR) { arr[interest]=arr[interest] RS $0; next; }
    (FNR!=NR) && arr[$1] { print(arr[$1]); }
    ' example_mini.fastq IDarray.txt 1> out.txt
    real    0m5.944s
    user    0m5.593s
    sys     0m0.242s
    
    $ md5sum out.txt
    8f199a78465f561fff3cbe98ab792262  out.txt
    

    On 1.5 GB file with 999 elements to search time is:

    real    1m4.333s
    user    0m59.491s
    sys     0m3.460s
    

    So per my predictions on my machine Your 15 GB example with 10k elements would take approx 16 minutes in user space to process.