Search code examples
rbranchtopologyphylogenyape

Obtaining mean phylogenetic tree branch lengths from an ensemble of phylogenetic trees


I have a set of phylogenetic trees, some with different topologies and different branch lengths. Here and example set:

(LA:97.592181158,((HS:82.6284812237,RN:72.190055848635):10.438414999999999):3.989335,((CP:32.2668593286,CL:32.266858085):39.9232054349,(CS:78.2389673073,BT:78.238955218815):8.378847):10.974376);

(((HS:71.9309734249,((CP:30.289472339999996,CL:30.289473923):31.8509454,RN:62.1404181356):9.790551):2.049235,(CS:62.74606492390001,BS:62.74606028250001):11.234141000000001):5.067314,LA:79.0475136246);

(((((CP:39.415718961379994,CL:39.4157161214):29.043224136600003,RN:68.4589436016):8.947169,HS:77.4061105636):4.509818,(BS:63.09170355585999,CS:63.09171066541):18.824224):13.975551000000001,LA:95.891473546);

(LA:95.630761929,((HS:73.4928857457,((CP:32.673882875400004,CL:32.673881941):33.703323212,RN:66.37720021233):7.115682):5.537861,(CS:61.798048265700004,BS:61.798043931600006):17.232697):16.600025000000002);

(((HS:72.6356569413,((CP:34.015223002300004,CL:34.015223157499996):35.207698155399996,RN:69.2229294656):3.412726):8.746038,(CS:68.62665546391,BS:68.6266424085):12.755043999999998):13.40646,LA:94.78814570300001);

(LA:89.58710099299999,((HS:72.440439124,((CP:32.270428384199995,CL:32.2704269484):32.0556597315,RN:64.32607145395):8.114349):6.962274,(CS:66.3266360702,BS:66.3266352709):13.076080999999999):10.184418);

(LA:91.116083247,((HS:73.8383213643,((CP:36.4068361936,CL:36.4068400719):32.297183626700004,RN:68.704029984267):5.134297):6.50389,(BS:68.6124876659,CS:68.61249734691):11.729719):10.773886000000001);

(((HS:91.025288418,((CP:40.288406529099994,CL:40.288401832999995):29.854198951399997,RN:70.14260821095):20.882673999999998):6.163698,(CS:81.12951949976,BS:81.12952162629999):16.059462):13.109915,LA:110.298870881);

In this example there are 2 unique topologies - using R's ape unique.multiPhylo shows that (assuming the example above is saved to a file tree.fn):

tree <- ape::read.tree(tree.fn)
unique.tree <- ape::unique.multiPhylo(tree, use.tip.label = F, use.edge.length = F)

> length(tree)
[1] 8
> length(unique.tree)
[1] 2

My question is how do I get a list of trees, each one representing a unique topology in the input list, and the branch lengths are a summary statistic, such as mean or median, across all trees with the same topology.

In the example above, it will return the first tree as is, because its topology is unique, and another tree which is the topology of the other trees, with mean or median branch lengths?


Solution

  • If I understand well, you want to sort all the trees for each unique into different groups (e.g. in your example, the first group contains one tree, etc...) and then measure some stats for each group?

    You can do that by first grouping the topologies into a list:

    set.seed(5)
    ## Generating 20 4 tip trees (hopefully they will be identical topologies!)
    tree_list <- rmtree(20, 4)
    ## How many unique topologies?
    length(unique(tree_list))
    ## Sorting the trees by topologies
    tree_list_tmp <- tree_list
    sorted_tree_list <- list()
    counter <- 0
    while(length(tree_list_tmp) != 0) {
        counter <- counter+1
        ## Is the first tree equal to any of the trees in the list
        equal_to_tree_one <- unlist(lapply(tree_list_tmp, function(x, base) all.equal(x, base, use.edge.length = FALSE), base = tree_list_tmp[[1]]))
        ## Saving the identical trees
        sorted_tree_list[[counter]] <- tree_list_tmp[which(equal_to_tree_one)]
        ## Removing them from the list
        tree_list_tmp <- tree_list_tmp[-which(equal_to_tree_one)]
        ## Repeat while there are still some trees!
    }
    ## The list of topologies should be equal to the number of unique trees
    length(sorted_tree_list) == length(unique(tree_list))
    ## Giving them names for fancyness
    names(sorted_tree_list) <- paste0("topology", 1:length(sorted_tree_list))
    

    Then for all the trees in each unique topology group you can extract different summary statistics by making a function. Here for example I will measure the branch length sd, mean and 90% quantiles.

    ## function for getting some stats
    get.statistics <- function(unique_topology_group) {
        ## Extract the branch lengths of all the trees
        branch_lengths <- unlist(lapply(unique_topology_group, function(x) x$edge.length))
    
        ## Apply some statistics
        return(c(   n = length(unique_topology_group),
                 mean = mean(branch_lengths),
                   sd = sd(branch_lengths),
                   quantile(branch_lengths, prob = c(0.05, 0.95))))
    } 
    
    ## Getting all the stats
    all_stats <- lapply(sorted_tree_list, get.statistics)
    ## and making it into a nice table
    round(do.call(rbind, all_stats), digits = 3)
    #            n  mean    sd    5%   95%
    # topology1  3 0.559 0.315 0.113 0.962
    # topology2  2 0.556 0.259 0.201 0.889
    # topology3  4 0.525 0.378 0.033 0.989
    # topology4  2 0.489 0.291 0.049 0.855
    # topology5  2 0.549 0.291 0.062 0.882
    # topology6  1 0.731 0.211 0.443 0.926
    # topology7  3 0.432 0.224 0.091 0.789
    # topology8  1 0.577 0.329 0.115 0.890
    # topology9  1 0.473 0.351 0.108 0.833
    # topology10 1 0.439 0.307 0.060 0.795
    

    Of course you can tweak it to get your own desired stats or even get the stats per trees per groups (using a double lapply lapply(sorted_trees_list, lapply, get.statistics) or something like that).