Search code examples
skbio

TabularMSA replacement for Alignment (scikit-bio 0.4.1.dev0)


I would like to read a PHYLIP alignment (FASTA format), update the sequence labels and write results back to a file. How can I edit the following lines to use TabularMSA in scikit-bio 0.4.1.dev0 (instead of Alignment which was supported earlier):

from skbio import Alignment ... msa_fa = Alignment.read(gene_msa_fa_fp, format='fasta') msa_fa_update_ids, new_to_old_ids = msa_fa.update_ids(func=id_mapper) msa_fa_update_ids.write(output_msa_phy_fp, format='phylip') ...

Thanks!


Solution

  • When reading a FASTA file into a TabularMSA object, sequence identifiers are stored in each sequence's metadata dictionary under key "id". When writing a TabularMSA object in PHYLIP format, the MSA's index property is used to label the sequences. Use reassign_index to use the FASTA sequence identifiers as the MSA's index, then map those to the sequence labels you want written, and finally write out in PHYLIP format:

    from skbio import TabularMSA, DNA
    msa = TabularMSA.read("aln.fasta", constructor=DNA)
    msa.reassign_index(minter='id')
    msa.reassign_index(mapping=id_mapper)
    msa.write('aln.phy', format='phylip')
    

    There are a variety of ways to set the index, including setting the property directly or using reassign_index with either mapping or minter parameters.