Search code examples
rdplyrtidyversephylogenyape

Using a function on a column from tree file class Phylo


I have a phylogenetic tree with many tips and internal nodes. I have a list of node ids from the tree. These are part of a separate table. I want to add a new column to the table, children. To get the descendants (nodes and tips), I am using phangorn::Descendants(tree, NODEID, type = 'all'). I can add length to get the number of descendants. For example,

phangorn::Descendants(tree, 12514, type = 'all')
[1] 12515 12517 12516  5345  5346  5347  5343  5344

length(phangorn::Descendants(tree, 12514, type = 'all'))
[1] 8

I would like to very simply take the column in my dataframe 'nodes', and use the function above length(phangorn::Descendants(tree, 12514, type = 'all')) to create a new column in the dataframe based off the input nodes.

Here is an example:

tests <- data.frame(nodes=c(12551, 12514, 12519))
length(phangorn::Descendants(tree, 12519, type = 'all'))
[1] 2
length(phangorn::Descendants(tree, 12514, type = 'all'))
[1] 8
length(phangorn::Descendants(tree, 12551, type = 'all'))
[1] 2
tests$children <- length(phangorn::Descendants(tree, tests$nodes, type = 'all'))
tests
  nodes children
1 12551        3
2 12514        3
3 12519        3

As shown above, the number of children is the length of the data.frame and not the actual number of children calculated above. It should be:

tests
  nodes children
1 12551        2
2 12514        8
3 12519        2

If you have any tips or idea on how I can have this behave as expected, that would be great. I have a feeling I have to use apply() or I need to index inside before using the length() function. Thank you in advance.


Solution

  • You're super close! Here's one quick solution using sapply! There are more alternatives but this one seems to follow the structure of your question!

    Generating some data

    library(ape)
    
    ntips <- 10
    tree <- rtree(ntips)
    targetNodes <- data.frame(nodes=seq(ntips+1, ntips+tree$Nnode))
    

    Note that I'm storing all the relevant nodes in the targetNodes object. This is equivalent to the following object in your question:

    tests <- data.frame(nodes=c(12551, 12514, 12519))
    

    Using sapply

    Now, let's use sapply to repeat the same operation across all the relevant nodes in targetNodes:

    targetNodes$children<- sapply(targetNodes$nodes, function(x){
      length(phangorn::Descendants(tree, x, type = 'all'))
    })
    

    I'm saving the output of our sapply function by creating a new column in targetNodes.

    Good luck!