I have 2 files one is fasta
file and other one is fastq
file. I want to take the fasta
, read each line and search each line in the fastq
file and print top line and bottom lines. This is what I have
fasta file
read1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG
for seq in `cat sequences`;do grep -A 2 -B 1 $seq FAP.1.txt;done
@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+ ;@?DBD<@@FFHHH<
@DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
@@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85
@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########
From this we can see that AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
appeared twice, but I want to print only once. how can I do that?
Final output file
@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
;@?DBD<@@FFHHH<
@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
+
@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B
Unless you have a strong reason to do this yourself, use Biopython.
fasta:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG
fastq (based on yours but not identical because your output was badly formatted):
@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
@@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85
@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
+
@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########
Code:
from Bio import SeqIO
with open("fasta") as fh:
fasta = fh.read().splitlines()
seen = set()
for record in SeqIO.parse(open('fastq'), 'fastq'):
seq = str(record.seq)
if seq in fasta and seq not in seen:
seen.add(seq)
print record.format('fastq')
EDIT: The above prints records in the order of the fastq file, not the fasta file. If order is not important, you should use that method. Otherwise, you can add the records to a dictionary where the key is their index in the FASTA file, and print them all in the end, sorting the dictionary:
from Bio import SeqIO
import sys
with open("fasta") as fh:
fasta = fh.read().splitlines()
seen = set()
records = {}
for record in SeqIO.parse(open('fastq'), 'fastq'):
seq = str(record.seq)
if seq in fasta and seq not in seen:
seen.add(seq)
records[fasta.index(seq)] = record
for record in sorted(records):
sys.stdout.write(records[record].format('fastq'))
(Here I also use sys.stdout.write
instead of print
, to avoid the extra newlines.)