I have a fastq file (file.fastq
) of about 80GB in size which has a header line and three subsequent information lines. I need to match /1 and /2 in header lines matching the header information and put them in one file sorted by /1 and /2 (interleaved).
For example, I want to find @HS2000-1015_160:5:2114:14081:69256/2
for @HS2000-1015_160:5:2114:14081:69256/1
and if found, I want to stack them together (repeating this for other headers as well) and get the result as shown below.
I was thinking something like this, but couldn't figure out how to do this
cat file.fastq | grep -A3 --no-group-separator ...
file.fastq
@HS2000-1015_160:5:1114:14809:86636/2
GTTTGGAAAACAGAAGGGATTCGTCTGGAAAGGGCGGGAAACAAAGGACTGGAAGGGAGTGATGGAAAAAAACATGATGAAACTGATCTGGAAAGAGCCA
+
@B@FFFFFHGHHHIIJJJHIJJJJIIJJJIJJJIJJIHFDDEDDBDDDDDBDDDDDDDD2?CDDDDDCCDDDDDDDDCCD@CDDDCDDEDDCDDDDC9<A
@HS2000-1015_160:5:2114:14081:69256/1
CACCCTGATCCAAGGTGGACCTCATATGCTATCCACCAATTCTTTCATTAATCTTTATCTGAGTTCCTATTACATATTTCAAAGAAACTGTCAAACAACT
+
CCCFFFFFHHHHHJJCGIIHIIJJJIJJJIIJJJJIJJ;FHIJIJJIGIGDIIJJJJJIJJIIFGIJIEIJJIJJIIIJJJHHFGHFFFFDFDEDEEDDD
@HS2000-1015_160:7:1108:13370:100570/1
GTGGGTCTTTGATCTTACAGGACAAATCCTTCAGAAAGATCAGGAAATCAAGGACCTAAAACAGAAGATAGCCGAAGTCATGGCCGTCATGCCCAGCATA
+
CCCFFFFFHHHFHGGIJIJDHEIIIGEHGGII>CFII<BG<DGCADDHBD>>BC?(8CGCCHIEC;@9@D47?BAD4,.(.6.;(3=8<A>@:<@AB9AA
@HS2000-1015_160:7:1108:13370:100570/2
CTTGACTGCCAGAGACGCTCCTTTGCAATGCCTTCCGGTAACCAAATTTTTGGGCACAACACACAGCTGGCCTTCATTTCTTCAGGGGCTGGTAAACAGA
+
@@@ADFFFHHHFD=EF@:GHIIFHH<ECHGF@DDBB:6@D60?F=888)8='--(=5@EAE5?'(..((.;?@>>A>3;@####################
@HS2000-1015_160:6:1306:14563:31825/1
TTTTAAACTTATGAACATCCAACAGCAACTAATGGTAAGTCAGAAAGCTTTTATTTTTAAGTATATGCATGTGTATGTGTGTGTTTTGTATGTGTGTGTC
+
CCCFFFFFHHGHHJEDHJIJJJJIIJJJJJJGGIICFHIDGHH@HGIIIGGHFHHG>GGDHEBFIJJJHGI@FCDEEEHHGHHEHHFFFEFFFEEEEEDD
@HS2000-1015_160:7:2308:13476:86835/2
ATGGTCCTAAGTAACGTGAGGCCCAGAGGGAGGAGACCTTCAGAGGAGGCCAGGATGGGGGTGATGGGGAACTTTTCCTGAAGCTGGAGTAGGGAAAGAT
+
@@CFDFFFHFHDHHGIGJGIJIJJEG@FHGCFGBF=DFGFHGG>GGGGHIHCHFCEHHFFD'8<ACDDDD38CCDDDDCDDACDCDC?B:>@C?>A?A>>
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:6:2116:4077:79041/1
TTGGGAGGGACATATCTAGGGTGGACCAAGCGCTGGCCAGGACTCACCTGGGCTGAAGAGGTGAGCCCAGAGGAGCTGGTGGTCTTGAAGCAGCTCCGCC
+
CCCFFFFFHHHHHIJIIIJFGCHIIIJJJIJJIJIJJJJJJJJJJJJJJJGHHHFFDDFC=ACEDDDB=?BD@ABDBDDCCBDCCCCCCCCCDDCDDDD?
@HS2000-1015_160:6:2116:4077:79041/2
GGTCCCCGCCTACGCCCACTGGGTTGGTGCACCTGGTGGTGGTGGCCGCCAAGAAGCTGGTGAACCGCCTCCAAGTGGCTCCCAAGACGCAGCTGGATGA
+
CCCFFFFFHHFHHJHJJJJJJJJGGHJHIGAGIIJIFHJ;@F;CHHFHFDDDDDCDDCDD9CCCDDBDDBBDDCDACDD8@BD3>?BCDBDDDACCDC@>
@HS2000-1015_160:5:1110:16577:55703/2
AGAAGCTGAAGTACGGTGTAGGAATGTTTCTACCACGCATTTTCCCTGCACCCTCTCACATCATTCATGGGCCATTTGTAGACAGATGGGAAAGAGATGT
+
CC@FFFFFHGDDHGIG:C<FBHEFHIGGGIIJIIIJIFJJJJIIHIJIHHIIDHIIIJJJJJJJIHHGEHHGEFFFCEDECDDEE>@CDDCCDBCDCCDC
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
@HS2000-1015_160:5:1216:14609:76938/2
TCAAAAGGCCTCTGGAGAGATTTCTGGGTGTGGGAGGTGCTATTTTGGGGCTTGGCCAGTCATATGGAGATAAGCCTACAAGGTTGGGGACCTGGCAGAT
+
CCCFFFFFFHHGHJJHHCCAFHIGHHJJ:E?FHHHDG?FGGIIJJIJIIJIJJIJHGCEECC@CBEFECEC@C@@CCDDDDDD>CCBDDD@??CBDC72:
@HS2000-1015_160:6:2209:18284:44195/1
GCAGCCAGCGGACGTCCAGGAACCGGGATGCCTCCAGCAGTGAGGCGGTCAGCCTGCAGCATGGGATGGCTGTGGATCTTTGGGGCAGCCCTGGGGCAGT
+
CCCFFFFFHHHHHIJHIIJJJJJJIJJEHIJJGIJJJJGHCHEHGFDDDDDCDDDDDDCDDCDDDCCBDCDDBCDDDDDDDDDBBDBDDDDDBDDDDBDC
@HS2000-1015_160:6:2209:18284:44195/2
TAAAATGTCACAAAGCTGGAAACTCTTCCCTATCACAAACCAAAACTTAAAAGGACGTTACCTGGCTGGGTCTAAACTCCACATAACTCGCTTGCAGTTG
+
CCCFFFFEHHHGHJIIIJJIJJHIIJEHJJHIJJJIIJJIJIJJIJIIHJJIJGGHGHGIIHHIIIIHFH@DFFFDEEEECDDDCDDDDBDDBBDCDACC
Result I want:
@HS2000-1015_160:7:1108:13370:100570/1
GTGGGTCTTTGATCTTACAGGACAAATCCTTCAGAAAGATCAGGAAATCAAGGACCTAAAACAGAAGATAGCCGAAGTCATGGCCGTCATGCCCAGCATA
+
CCCFFFFFHHHFHGGIJIJDHEIIIGEHGGII>CFII<BG<DGCADDHBD>>BC?(8CGCCHIEC;@9@D47?BAD4,.(.6.;(3=8<A>@:<@AB9AA
@HS2000-1015_160:7:1108:13370:100570/2
CTTGACTGCCAGAGACGCTCCTTTGCAATGCCTTCCGGTAACCAAATTTTTGGGCACAACACACAGCTGGCCTTCATTTCTTCAGGGGCTGGTAAACAGA
+
@@@ADFFFHHHFD=EF@:GHIIFHH<ECHGF@DDBB:6@D60?F=888)8='--(=5@EAE5?'(..((.;?@>>A>3;@####################
+
@HS2000-1015_160:5:2306:10070:71746/1
AGAAGACGAGCACAGTGAGGAAGGAGATGAGCAGGCAGGGGATGATGAGGTTGATGGTGTAGAACAAGGGCAGGCGCCGGATGTACAGCGAGTATGTGAT
+
CCCFFFFFHHHHHIICGGHHHAFGEFGIJJJIIJIIGIIGG;FGHEHHGI=CHCEHGHE?DFFDFECEDDDDDDDDDBDDBD@ACDAACDBD5<CDC>@>
@HS2000-1015_160:5:2306:10070:71746/2
GAACCTCAAGGACTATTGGGAGAGCGGCGAGTGGGCCATCATCAAAGCCCCAGGCTACAAACACGACATCAAGTACAACTGCTGCGAGGAGATCTACCCC
+
@CCFFFFDHGHFHIJJJJJJGGGIIJJIGHI@FHGIIGHHEFGHHFFFFFBCDDDDDCDDDDDDDD;@BDCCDACDD@>ACCDDDDBDB<BA?C@CC@BD
@HS2000-1015_160:6:2116:4077:79041/1
TTGGGAGGGACATATCTAGGGTGGACCAAGCGCTGGCCAGGACTCACCTGGGCTGAAGAGGTGAGCCCAGAGGAGCTGGTGGTCTTGAAGCAGCTCCGCC
+
CCCFFFFFHHHHHIJIIIJFGCHIIIJJJIJJIJIJJJJJJJJJJJJJJJGHHHFFDDFC=ACEDDDB=?BD@ABDBDDCCBDCCCCCCCCCDDCDDDD?
@HS2000-1015_160:6:2116:4077:79041/2
GGTCCCCGCCTACGCCCACTGGGTTGGTGCACCTGGTGGTGGTGGCCGCCAAGAAGCTGGTGAACCGCCTCCAAGTGGCTCCCAAGACGCAGCTGGATGA
+
CCCFFFFFHHFHHJHJJJJJJJJGGHJHIGAGIIJIFHJ;@F;CHHFHFDDDDDCDDCDD9CCCDDBDDBBDDCDACDD8@BD3>?BCDBDDDACCDC@>
+
@HS2000-1015_160:5:2113:11446:94436/1
AATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGCACCACTGCAGTCCAGCCTGGGAAACCAGAGCAAAACTTGGTCTCAAAAAAAAAA
+
CCCFFFFFHHHHGHGFGFHJIBF1DC<@;)9D*099?F;8=(BFF5(;'.=CEHFBDC7.>;?C@BBB35>(<?<<A>AC?B>4+(+>A:(:@(2)5@B#
@HS2000-1015_160:5:2113:11446:94436/2
CGTCAGGGCCAACCCCGCCCCACCCTGACCCTACCTGGCACCCCTCACCTGTGGCCTGCCAGCACAGCCTCGCCCCTGCTGGCCAATGTGTCCCCCGTCA
+
?@@DA@DDFHH?DHI)<@@FHDBGGCHCBDH;DFA<)6.=7D;@CBCHD)).7@=>;?==AABC95<(5(5309@D########################
+
@HS2000-1015_160:6:2209:18284:44195/1
GCAGCCAGCGGACGTCCAGGAACCGGGATGCCTCCAGCAGTGAGGCGGTCAGCCTGCAGCATGGGATGGCTGTGGATCTTTGGGGCAGCCCTGGGGCAGT
+
CCCFFFFFHHHHHIJHIIJJJJJJIJJEHIJJGIJJJJGHCHEHGFDDDDDCDDDDDDCDDCDDDCCBDCDDBCDDDDDDDDDBBDBDDDDDBDDDDBDC
@HS2000-1015_160:6:2209:18284:44195/2
TAAAATGTCACAAAGCTGGAAACTCTTCCCTATCACAAACCAAAACTTAAAAGGACGTTACCTGGCTGGGTCTAAACTCCACATAACTCGCTTGCAGTTG
+
CCCFFFFEHHHGHJIIIJJIJJHIIJEHJJHIJJJIIJJIJIJJIJIIHJJIJGGHGHGIIHHIIIIHFH@DFFFDEEEECDDDCDDDDBDDBBDCDACC
A bit long for a comment ... wondering if either of the following may be better suited for processing these largish data sets.
NOTE: I don't work with fastq
data and the following is based on ~15 minutes of googling, so fwiw ...
1 - Generate the fastq files in a fashion that allows for faster processing
The documentation for fastq-dump mentions that use of the --split-3
flag will dump the fastq data into 3x files ... a 'left' file, a 'right' file and a 'singleton' file.
Would the 'left' and 'right' files be referring to your /1
and /2
suffixes, respectively? If 'yes' then I'm wondering if you could generate the 3x files (and then discard the 'singleton' file)?
Would also be of interest to see if the data in the 'left' and 'right' files have their headers in the same order (in which case some sort of 'merge' join between the 2x files would likely be faster than manually pre-sorting and then matching).
Assuming the 3x files could be generated, this leads to the second idea ...
2 - Are there any tools designed for matching the left/right (/1 & /2
) datasets?
The brief introduction to fastq-pair seems to indicate that multiple files (like the 'left' and 'right' files generated by fastq-dump
- above ??) can be matched more quickly than many brute force solutions (eg, a referenced awk
solution which is a bit similar to my first answer).