I would like to read a PHYLIP alignment (FASTA format), update the sequence labels and write results back to a file. How can I edit the following lines to use TabularMSA in scikit-bio 0.4.1.dev0 (instead of Alignment which was supported earlier):
from skbio import Alignment
...
msa_fa = Alignment.read(gene_msa_fa_fp, format='fasta')
msa_fa_update_ids, new_to_old_ids = msa_fa.update_ids(func=id_mapper)
msa_fa_update_ids.write(output_msa_phy_fp, format='phylip')
...
Thanks!
When reading a FASTA file into a TabularMSA
object, sequence identifiers are stored in each sequence's metadata
dictionary under key "id"
. When writing a TabularMSA
object in PHYLIP format, the MSA's index
property is used to label the sequences. Use reassign_index
to use the FASTA sequence identifiers as the MSA's index, then map those to the sequence labels you want written, and finally write out in PHYLIP format:
from skbio import TabularMSA, DNA
msa = TabularMSA.read("aln.fasta", constructor=DNA)
msa.reassign_index(minter='id')
msa.reassign_index(mapping=id_mapper)
msa.write('aln.phy', format='phylip')
There are a variety of ways to set the index, including setting the property directly or using reassign_index
with either mapping
or minter
parameters.