I have a large file with lots of FASTA sequences in it. Some of them need to be renamed -- I am trying to replace FASTA sequence IDs with an updated version of them. I stored the information in a dictionary such that the old ID is the key with the new ID as the value. No matter what I do, I can't seem to either replace the IDs or write a new fasta file properly. I'm using SeqIO to read in my fasta file. Here is some of my code:
This produces a shallow replacement of the record IDs in that they print here accurately, but are not actually changed in the file itself:
rename_fastas = {'446_was_445_cDNA_v01VT':'446_cDNA_v01VT', '446_was_445_cDNA_v03VT': '446_cDNA_v03VT',
'428PBMC_2_V3': '428_PBMC_2_V3', '428PBMC_3_V3': '428_PBMC_3_V3', '428PBMC_4_V3': '428_PBMC_4_V3', '428PBMC_5_V3': '428_PBMC_5_V3'}
with open('fasta.fsa') as f:
for seq_record in SeqIO.parse(f, 'fasta'):
for k,v in rename_fastas.items():
if seq_record.id == k:
seq_record.id = seq_record.description = seq_record.id.replace(k,v)
print(seq_record.id)
this gave me waaay too many entries in my new file:
with open('fasta.fsa') as original,
open('fasta2.fsa', 'w') as corrected:
records = SeqIO.parse(original, 'fasta')
for record in records:
for k, v in rename_fastas.items():
if record.id == k:
record.id = record.description.replace(k, v)
else:
record.id = record.id
SeqIO.write(record, corrected, 'fasta')
this also did not work and I'm not sure why:
with open('fasta.fsa') as f:
for seq_record in SeqIO.parse(f, 'fasta'):
seq_record.id = seq_record.description = seq_record.id.replace('428PBMC','428_PBMC')
seq_record.id = seq_record.description = seq_record.id.replace('446_was_445','446')
yield seq_record
Any help would be appreciated!
Try that one:
rename_fastas = {'446_was_445_cDNA_v01VT':'446_cDNA_v01VT', '446_was_445_cDNA_v03VT': '446_cDNA_v03VT', '428PBMC_2_V3': '428_PBMC_2_V3', '428PBMC_3_V3': '428_PBMC_3_V3', '428PBMC_4_V3': '428_PBMC_4_V3', '428PBMC_5_V3': '428_PBMC_5_V3'}
with open('fasta.fsa') as original, open('fasta2.fsa', 'w') as corrected:
for seq_record in SeqIO.parse(original, 'fasta'):
if seq_record.id in rename_fastas:
seq_record.id = seq_record.description = rename_fastas[seq_record.id]
SeqIO.write(seq_record, corrected, 'fasta')
You open files for input and output. You have a dict with the proper keys, so there is no need to traverse it every time, just ask the dict to do its job and access it through its keys. If the key is present in the dict, substitute the entire value of the ID for the value in the dict. Finally, write the corrected record to the output file.