I have a fasta file. From that file, I need to get the only sequences containing 'CCNNNGG'
(where 'N'
represents random nucleotides) and put them in a new fasta file.
Example (it should output the first sequence):
m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3312_3597 CGCGGCATCGAATTAATACGACTCACTATAGGTTTTTTTATTG*********CCTACGG***********GTATTTTCAGTTAGATTCTTTCTTCTTAGAGGGTACAGAGAAAGGGAGAAAATAGCTACAGACATGGGAGTGAAAGGTAGGAAGAAGAGCGAAGCAGACATTATTCA
m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3708_4657 CAACGGTTTTGCCACAAGATCAGGAACATAAGTCACCAGACTCAATTCATCCCCATAAGACCTCGGACCTCTCAATCCTCGAATTAGGATGTTCTCGTACGGTCTATCAGTATATAAACCTGACATACTATAAAAAAGTATACCAT TCTTATCATGTACAGTAGGGTACAGTAGG
(*
s added for highlighting)
And my code:
from Bio import SeqIO
my_sequences = []
for record in SeqIO.parse(open("example.fa", "rU"), "fasta") :
if "CCTACGG" in record.seq : #Works fine with CCTACGG
my_sequences.append(record)
output_handle = open("my_seqs.fasta", "w")
SeqIO.write(my_sequences, output_handle, "fasta")
output_handle.close()
My problem is that I don't know how to write random nucleotides, instead of write "CCTACGG"
after if
I want to put 'CCNNNGG'
, where N
represents random nucleotides ('C'
or 'T'
or 'G'
or 'A'
).
You can use regular expressions to do this, via Python's re
module:
import re
pattern = 'CCNNNGG'
regex = re.compile(pattern.replace('N', '[ACGT]'))
for record in SeqIO.parse(...):
if re.search(regex, record.seq) is not None:
my_sequences.append(record)
This replaces every 'N'
in your pattern with '[ACGT]'
, which will match any one of those four characters, then searches for that pattern in each record.seq
.
Also, note that your examples aren't very good - the second one also matches that pattern (it contains 'CCCATGG'
) - see the results!