Search code examples
ralgorithmcomparisonphylogeny

Compare list of complex structures


I have two lists of complex structures (each list is a multiPhylo object containing phylogenetic trees), and I would like to find out how many times each element of the first one appears in the second one. Pretty straightforward, but for some reasons my code returns wrong results.

library(devtools)
install_github('santiagosnchez/rBt')
library(rBt)

beast_output <- read.annot.beast('strict_BD_starBEAST_logcomb.species.trees')
beast_output_rooted <- root(beast_output, c('taxon_A', 'taxon_B')) # length == 20,000
unique_multiphylo <- unique.multiPhylo(beast_output_rooted) # length == 130

count <- function(item, list) {
  total = 0
  for (i in 1:length(list)) {
    if (all.equal.phylo(item, list[[i]])) {
      total = total + 1
    }
  }
  return(total)
}

result <- data.frame(un_tr = rep(0, 130), count = rep(0, 130))
for (i in 1:length(unique_multiphylo)) {
  result[i, ] <- c(i, count(unique_multiphylo[[i]], beast_output_rooted))
}

The function all.equal.phylo() takes two phylo objects and returns TRUE if they are the same. See docs. The function count() takes an item and a list and returns the number of times the item appears in the list using all.equal.phylo().

The issue is that the function count() returns 0 most of the time. This should not be possible as the list unique_multiphylo is a sublist of beast_output_rooted, which means that count() should at least return 1.

What is wrong with my code? And how can I correct it? Many thanks for the help!

EDIT: here is a reproducible example:

install.packages('ape')
library(ape)

set.seed(42)
trees <- lapply(rep(c(10, 25, 50, 100), 3), rtree) # length == 12
class(trees) <- 'multiPhylo'
unique_multiphylo <- unique.multiPhylo(trees) # length == 12

count <- function(item, list) {
  total = 0
  for (i in 1:length(list)) {
    if (all.equal.phylo(item, list[[i]])) {
      total = total + 1
    }
  }
  return(total)
}

result <- data.frame(un_tr = rep(0, 12), count = rep(0, 12))
for (i in 1:length(unique_multiphylo)) {
  result[i, ] <- c(i, count(unique_multiphylo[[i]], trees))
}

However, it seems to be working perfectly fine with these simulated data...


Solution

  • There is a shorter solution to your problem:

    table( attr(unique_multiphylo, "old.index") )
    

    as unique_multiphylo contains an attribute with the information you are after (see ?unique.multiPhylo).