Search code examples
rtreephylogeny

Phylogenetics in R: different results when working on a tree compared to reading it in


This question stems directly from a previous one I asked here:

Phylogenetics in R: collapsing descendant tips of an internal node

((((11:0.00201426,12:5e-08,(9:1e-08,10:1e-08,8:1e-08)40:0.00403036)41:0.00099978,7:5e-08)42:0.01717066,(3:0.00191517,(4:0.00196859,(5:1e-08,6:1e-08)71:0.00205168)70:0.00112995)69:0.01796015)43:0.042592645,((1:0.00136179,2:0.00267375)44:0.05586907,(((13:0.00093161,14:0.00532243)47:0.01252989,((15:1e-08,16:1e-08)49:0.00123243,(17:0.00272478,(18:0.00085725,19:0.00113572)51:0.01307761)50:0.00847373)48:0.01103656)46:0.00843782,((20:0.0020268,(21:0.00099593,22:1e-08)54:0.00099081)53:0.00297097,(23:0.00200672,(25:1e-08,(36:1e-08,37:1e-08,35:1e-08,34:1e-08,33:1e-08,32:1e-08,31:1e-08,30:1e-08,29:1e-08,28:0.00099682,27:1e-08,26:1e-08)58:0.00200056,24:1e-08)56:0.00100953)55:0.00210137)52:0.01233888)45:0.01906982)73:0.003562205)38;

I have created this tree from a gene tree in R, using the midpoint function to root it. Now I apply the following function to it:

drop_dupes <- function(tree,thres=1e-05){
  tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
  toDrop <- tree$edge.length[tips] < thres
  newtree <- drop.tip(tree,tree$tip.label[toDrop])
  return(newtree)
}

Now my issue is that when I apply this function I do not get the tree I expected. The output when plotted looks like this:

enter image description here

However, if I write the tree out to a text file using write.tree and then read it in again as a Newick string, when I then apply the drop_dupes function above, I get a different tree, as shown in the below plot:

enter image description here

And this is what is confusing me. Why am I getting two different outputs when applying the same function to the very same tree, with the only difference being if it's read in or it's already a variable in R?

Edit: Here are some further details. The initial gene tree is as follows:

(B.weihenstephanensis.KBAB4.ffn:0.00136179,B.weihenstephanensisWSBC10204.ffn:0.00267375,(((B.cereus.NJW.ffn:0.00191517,(B.thuringiensis.HS181.ffn:0.00196859,(B.thuringiensis.Bt407.ffn:0.00000001,B.thuringiensis.chinensisCT43.ffn:0.00000001)0.879000:0.00205168)0.738000:0.00112995)0.969000:0.01796015,(B.cereus.FORC013.ffn:0.00000005,((B.thuringiensis.galleriaeHD29.ffn:0.00000001,(B.thuringiensis.kurstakiYBT1520.ffn:0.00000001,B.thuringiensis.YWC28.ffn:0.00000001)0.000000:0.00000001)0.971000:0.00403036,(B.cereus.ATCC14579.ffn:0.00201426,B.thuringiensis.tolworthi.ffn:0.00000005)0.000000:0.00000005)0.377000:0.00099978)0.969000:0.01717066)1.000000:0.04615485,(((B.cereus.FM1.ffn:0.00093161,B.cereus.FT9.ffn:0.00532243)0.990000:0.01252989,((B.cereus.AH187.ffn:0.00000001,B.cereus.NC7401.ffn:0.00000001)0.694000:0.00123243,(B.thuringiensis.finitimusYBT020.ffn:0.00272478,(B.cereus.ATCC10987.ffn:0.00085725,B.cereus.FRI35.ffn:0.00113572)0.994000:0.01307761)0.973000:0.00847373)0.972000:0.01103656)0.863000:0.00843782,((B.thuringiensis.9727.ffn:0.00202680,(B.cereus.03BB102.ffn:0.00099593,B.cereus.D17.ffn:0.00000001)0.741000:0.00099081)0.822000:0.00297097,(B.cereus.E33L.ffn:0.00200672,(B.cereus.S28.ffn:0.00000001,(B.thuringiensis.HD1011.ffn:0.00000001,(B.anthracis.Vollum1B.ffn:0.00000001,(B.anthracis.Turkey32.ffn:0.00000001,(B.anthracis.RA3.ffn:0.00099682,(B.anthracis.Pasteur.ffn:0.00000001,(B.anthracis.Larissa.ffn:0.00000001,(B.anthracis.Cvac02.ffn:0.00000001,(B.anthracis.CDC684.ffn:0.00000001,(B.anthracis.BFV.ffn:0.00000001,(B.anthracis.Ames.ffn:0.00000001,(B.anthracis.A16R.ffn:0.00000001,(B.anthracis.A0248.ffn:0.00000001,B.anthracis.A1144.ffn:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000005)0.000000:0.00000005)0.956000:0.00200056)0.000000:0.00000006)0.805000:0.00100953)0.809000:0.00210137)0.957000:0.01233888)0.929000:0.01906982)1.000000:0.05586907);

I read it into R as tree1. I then use the following code:

#Function to label nodes and tips as sequential integers
sort.names <- function(tr){
  tr$node.label<-(length(tr$tip.label) + 1):(length(tr$tip.label)+ tr$Nnode)
  ##some of these are tips, some are nodes, need to treat differently
  tr$tip.label<-1:(length(tr$tip.label))
  return(tr)
}

#Function to check if tree is rooted, and if it is not to use midpoint #rooting
rootCheck <- function(tree){
  if(is.rooted(tree) == FALSE){
    rootedTree <- midpoint(tree)
  }
  return(rootedTree)
}

#The above mentioned function to remove duplicate tips
drop_dupes <- function(tree,thres=1e-05){
  tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
  toDrop <- tree$edge.length[tips] < thres
  newtree <- drop.tip(tree,tree$tip.label[toDrop])
  return(newtree)
}

#Use functions on tree
a <- rootCheck(tree1)
b <- sort.names(a)
c <- di2multi(b, tol = 1e-05)
d <- drop_dupes(c)

Now at this point if I plot tree d, I will get the first plot above. However, if I write tree c to a text file, then read it back in and and then use the drop_dupes function on it, I will get the latter tree.

I have checked the newick file of tree c against the Newick tree at the top of the page and it is definitely the same.


Solution

  • The problem is in the sort.names function. It effectively rearranges the way the tree is written, and indexing then refers to other nodes than it should.

    If you need the tip.labels numbered, why not label them separately?

    tax.num <- data.frame(taxa = tree1$tip.label, 
        numbers = 1:(length(tree1$tip.label)))
    
    a <- midpoint(drop_dupes(tree1))
    a$tip.label <- tax.num$number[match(a$tip.label, tax.num$taxa)]
    
    plot(a)
    

    Also, di2multi seems to be redundant. It creates polytomies where the goal, as I understand it, is to discard tips with short branches.