Search code examples
bioinformaticsbiopythondna-sequencephylogenysequence-alignment

How does Biopython determine the root of a phylogenetic tree?


There are other packages, particularly ape for R, that build an unrooted tree then allow you to root it by explicitly specifying an outgroup.

In contrast, in BioPython I can directly create a rooted tree without specifying the root, so I'm wondering how the root is being determined, for example from the following code.

from Bio import AlignIO
alignment = AlignIO.read('mulscle-msa-aligned-105628a58654.fasta', 'fasta')
from Bio.Phylo.TreeConstruction import DistanceCalculator
calculator = DistanceCalculator('ident')
dm = calculator.get_distance(alignment)
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
constructor = DistanceTreeConstructor()
tree = constructor.upgma(dm)
from Bio import Phylo
Phylo.write(tree, 'phyloxml-7016bed7d42.xml', 'phyloxml')

I made up the sequences here after the tree was built, but nonetheless this is a rooted tree built from that process.

enter image description here


Solution

  • As @cel said, this is a product of the UPGMA algorithm. UPGMA creates a tree by working backward from the present (or whenever your data are from). It starts by finding the two most similar species. In theory, these species have a more recent common ancestor than any other pair of species, so they are grouped together. The similarity of their common ancestor to other species in the tree is loosely estimated by averaging each species' similarity to all members of the group.

    This process continues, grouping the two most similar species (or presumed common ancestors) in the tree at each step and then recalculating similarities, until there are only two groups left. One of these groups may have only one member, in which case it can effectively be thought of as the outgroup, but they may also both have many members. The root of the tree is the common ancestor of these two groups.