In a file, I have some characters to be substituted.
letters = ["B", "Z", "J", "U", "O"]
for record in SeqIO.parse(inFile, "fasta"):
for letter in letters:
if letters in str(record.seq):
print record.id
record.seq = str(record.seq).replace(letter, "X")
outFile.write(">%s\n%s\n" % (record.description, record.seq))
else:
outFile.write(">%s\n%s\n" % (record.description, record.seq))
#pass
The problem is that the output looks like this, writing the outputs as many characters I have in letters:
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
> >ID:WP_004160595.1|Erwinia_amylovora_01SFR-BO|01SFR-BO|50S_ribosomal_protei..|630|NZ_CAPA01000010(58437):26053-26682:-1
> MIGLVGKKVGMTRIFTEDGVSIPVTVIEIEANRVTQVKGLENDGYTAIQVTTGAKKANRVTKPAAGHFAKAGVEAGRGLWEFRTAEGAEFTVGQSINVDIFADVKKVDVTGTSKGKGFAGTVKRWNFRTQDATHGNSLSHRVPGSIGQNQTPGKVFKGKKMAGQLGNERVTVQSLDVVRVDAERNLLLVKGAVPGATGSDLIVKPAVKA
I think what you are trying to do is replace the ambiguous IUPAC amino acid codes (plus some additional letters that you have somehow gained?) with 'X'
.
Better to use str.translate()
(in Python 3) to do all your replacements at once. Also, since you're using Biopython to read the file, you can also write the output file easily with Biopython.
from Bio import SeqIO
from Bio.Seq import Seq
letters = ["B", "Z", "J", "U", "O"]
trans_tab = str.maketrans(''.join(letters), 'X'*len(letters))
def yield_seqs(in_file):
for record in SeqIO.parse(in_file, 'fasta'):
record.seq = Seq(str(record.seq).translate(trans_tab))
yield record
SeqIO.write(yield_seqs('input.fasta'), 'output.fasta', 'fasta')
Example:
$ cat input.fasta
>1
MBZJ
$ python3 myscript.py
$ cat output.fasta
>1
MXXX