I have two fastq files that I want to compare. The structure of the fastq files are as follows (per read):
The two files I am working with are:
File1 = Contains specific sequences of bases which I am interested in, along with their corresponding header lines (2 lines per read)
File2 = A normal fastq file for comparison (4 lines per read)
I need to compare file 1 and 2 to find all matching header lines, then save the corresponding sequence lines from file2.
Where file 1:
@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGA
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
@HWI-D00461:137:C9H2FACXX:3:1101:1087:1973 1:N:0:GGCTAC
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGATGTTCAGAAGCCTGTGTATGA
And file 2:
@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGN
+
#1=BDDFFHHHFHIJJJJJJJJJJJJJJJJJJJJJIJJIJJJJJHJIIJHGIJJJJJJIHJJBGHJHIIJJJHHHHFFFFEEEDD;?BACDDDA?@CDDDC
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACNNN
+
#1=DDDFDFHHHGHIIGJJJJHIJIHHDIHHIJGGEI@GFGHIHIJHEFHIIIIGIJGHHGECFGIDHGIHIIEGIIJHHEEFFF7?ACEECCBBDEDDDC
@HWI-D00461:137:C9H2FACXX:3:1101:1200:1972 1:N:0:GGCTAC
NTACGTTTAGTAGAGACAGTGTCTTGCTATGTTGCCCAGGCTGGTCTCAAACTCCTGAGCTCTAGCAAGCCTTCCACCTCTGCCTCCCAGTGTTCTGGGAT
+
#1=DDDDFHHHBHGIGIIJHCDHHIJJJHEGFIIHFHGEGHJEIFHHHEFHHGIGIJEHIIJJJJIJIJIJGIIH.?CEFFFFDCEDD3>>@CDDDDDD<@
In other words, for every matching header line, I want to save its partnered sequence line.
Finally, I want to write all these headers and sequences to a new file, for downstream analysis.
My current output is:
@HWI-D00461:137:C9H2FACXX:3:1101:1357:1984 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1755:2000 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1260:1977 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1917:1984 1:N:0:GGCTAC
My desired output would be:
@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGA
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
@HWI-D00461:137:C9H2FACXX:3:1101:1087:1973 1:N:0:GGCTAC
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGATGTTCAGAAGCCTGTGTATGA
Here is what I have so far.
ids = ''
with open(no_adapter_file, 'r') as file1:
with open(comparison_file, 'r') as file2:
common = set(file1).intersection(file2)
for line in common:
if line[0] == '@'
ids += line
with open(comparison_file, 'r') as file2:
ids_seq = ''
for line in file2:
if line == ids:
line += ids_seq
with open(new_file, 'w') as file_out:
for line in ids_seq:
file_out.write(line)
print(new_file + " was created.")
The code successfully pulls out all matching header lines, but I don't know how to pull out the sequence lines which follow.
EDIT: Added some examples of input and expected output.
There are several ways to tackle your problem. This is one that assumes nothing about the sorting of the files (which could be quite useful to avoid the potentially high memory consumption). It also assumes nothing about the possibility of having duplicate IDs (which could also reduce the memory footprint in case the reads are not guaranteed to be in the same order).
# First, get the IDs we are interested in
ids = set()
with open(no_adapter_file, 'r') as file1:
line_number = 0
for line in file1:
line_number += 1
if line_number % 2 == 1:
ids.add(line.rstrip())
# Now, read the second file and print desired info if the ID is
# one of the interesting ones
with open(comparison_file, 'r') as file2, open(new_file, 'w') as file_out:
line_number = 0
for line in file2:
line_number += 1
if line_number % 4 == 1:
candidate_id = line.rstrip()
elif line_number % 4 == 2:
candidate_seq = line.rstrip()
if candidate_id in ids:
file_out.write("{}\n{}\n".format(candidate_id, candidate_seq))
print(new_file + " was created.")
In this code, we first read the first file and store the IDs in a set. Then, this set is used to check if the reads coming from the second file must be written or not.