This question stems directly from a previous one I asked here:
Phylogenetics in R: collapsing descendant tips of an internal node
((((11:0.00201426,12:5e-08,(9:1e-08,10:1e-08,8:1e-08)40:0.00403036)41:0.00099978,7:5e-08)42:0.01717066,(3:0.00191517,(4:0.00196859,(5:1e-08,6:1e-08)71:0.00205168)70:0.00112995)69:0.01796015)43:0.042592645,((1:0.00136179,2:0.00267375)44:0.05586907,(((13:0.00093161,14:0.00532243)47:0.01252989,((15:1e-08,16:1e-08)49:0.00123243,(17:0.00272478,(18:0.00085725,19:0.00113572)51:0.01307761)50:0.00847373)48:0.01103656)46:0.00843782,((20:0.0020268,(21:0.00099593,22:1e-08)54:0.00099081)53:0.00297097,(23:0.00200672,(25:1e-08,(36:1e-08,37:1e-08,35:1e-08,34:1e-08,33:1e-08,32:1e-08,31:1e-08,30:1e-08,29:1e-08,28:0.00099682,27:1e-08,26:1e-08)58:0.00200056,24:1e-08)56:0.00100953)55:0.00210137)52:0.01233888)45:0.01906982)73:0.003562205)38;
I have created this tree from a gene tree in R, using the midpoint
function to root it. Now I apply the following function to it:
drop_dupes <- function(tree,thres=1e-05){
tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
toDrop <- tree$edge.length[tips] < thres
newtree <- drop.tip(tree,tree$tip.label[toDrop])
return(newtree)
}
Now my issue is that when I apply this function I do not get the tree I expected. The output when plotted looks like this:
However, if I write the tree out to a text file using write.tree
and then read it in again as a Newick string, when I then apply the drop_dupes
function above, I get a different tree, as shown in the below plot:
And this is what is confusing me. Why am I getting two different outputs when applying the same function to the very same tree, with the only difference being if it's read in or it's already a variable in R?
Edit: Here are some further details. The initial gene tree is as follows:
(B.weihenstephanensis.KBAB4.ffn:0.00136179,B.weihenstephanensisWSBC10204.ffn:0.00267375,(((B.cereus.NJW.ffn:0.00191517,(B.thuringiensis.HS181.ffn:0.00196859,(B.thuringiensis.Bt407.ffn:0.00000001,B.thuringiensis.chinensisCT43.ffn:0.00000001)0.879000:0.00205168)0.738000:0.00112995)0.969000:0.01796015,(B.cereus.FORC013.ffn:0.00000005,((B.thuringiensis.galleriaeHD29.ffn:0.00000001,(B.thuringiensis.kurstakiYBT1520.ffn:0.00000001,B.thuringiensis.YWC28.ffn:0.00000001)0.000000:0.00000001)0.971000:0.00403036,(B.cereus.ATCC14579.ffn:0.00201426,B.thuringiensis.tolworthi.ffn:0.00000005)0.000000:0.00000005)0.377000:0.00099978)0.969000:0.01717066)1.000000:0.04615485,(((B.cereus.FM1.ffn:0.00093161,B.cereus.FT9.ffn:0.00532243)0.990000:0.01252989,((B.cereus.AH187.ffn:0.00000001,B.cereus.NC7401.ffn:0.00000001)0.694000:0.00123243,(B.thuringiensis.finitimusYBT020.ffn:0.00272478,(B.cereus.ATCC10987.ffn:0.00085725,B.cereus.FRI35.ffn:0.00113572)0.994000:0.01307761)0.973000:0.00847373)0.972000:0.01103656)0.863000:0.00843782,((B.thuringiensis.9727.ffn:0.00202680,(B.cereus.03BB102.ffn:0.00099593,B.cereus.D17.ffn:0.00000001)0.741000:0.00099081)0.822000:0.00297097,(B.cereus.E33L.ffn:0.00200672,(B.cereus.S28.ffn:0.00000001,(B.thuringiensis.HD1011.ffn:0.00000001,(B.anthracis.Vollum1B.ffn:0.00000001,(B.anthracis.Turkey32.ffn:0.00000001,(B.anthracis.RA3.ffn:0.00099682,(B.anthracis.Pasteur.ffn:0.00000001,(B.anthracis.Larissa.ffn:0.00000001,(B.anthracis.Cvac02.ffn:0.00000001,(B.anthracis.CDC684.ffn:0.00000001,(B.anthracis.BFV.ffn:0.00000001,(B.anthracis.Ames.ffn:0.00000001,(B.anthracis.A16R.ffn:0.00000001,(B.anthracis.A0248.ffn:0.00000001,B.anthracis.A1144.ffn:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000001)0.000000:0.00000005)0.000000:0.00000005)0.956000:0.00200056)0.000000:0.00000006)0.805000:0.00100953)0.809000:0.00210137)0.957000:0.01233888)0.929000:0.01906982)1.000000:0.05586907);
I read it into R as tree1. I then use the following code:
#Function to label nodes and tips as sequential integers
sort.names <- function(tr){
tr$node.label<-(length(tr$tip.label) + 1):(length(tr$tip.label)+ tr$Nnode)
##some of these are tips, some are nodes, need to treat differently
tr$tip.label<-1:(length(tr$tip.label))
return(tr)
}
#Function to check if tree is rooted, and if it is not to use midpoint #rooting
rootCheck <- function(tree){
if(is.rooted(tree) == FALSE){
rootedTree <- midpoint(tree)
}
return(rootedTree)
}
#The above mentioned function to remove duplicate tips
drop_dupes <- function(tree,thres=1e-05){
tips <- which(tree$edge[,2] %in% 1:Ntip(tree))
toDrop <- tree$edge.length[tips] < thres
newtree <- drop.tip(tree,tree$tip.label[toDrop])
return(newtree)
}
#Use functions on tree
a <- rootCheck(tree1)
b <- sort.names(a)
c <- di2multi(b, tol = 1e-05)
d <- drop_dupes(c)
Now at this point if I plot tree d
, I will get the first plot above. However, if I write tree c
to a text file, then read it back in and and then use the drop_dupes
function on it, I will get the latter tree.
I have checked the newick file of tree c
against the Newick tree at the top of the page and it is definitely the same.
The problem is in the sort.names
function. It effectively rearranges the way the tree is written, and indexing then refers to other nodes than it should.
If you need the tip.labels numbered, why not label them separately?
tax.num <- data.frame(taxa = tree1$tip.label,
numbers = 1:(length(tree1$tip.label)))
a <- midpoint(drop_dupes(tree1))
a$tip.label <- tax.num$number[match(a$tip.label, tax.num$taxa)]
plot(a)
Also, di2multi
seems to be redundant. It creates polytomies where the goal, as I understand it, is to discard tips with short branches.