Search code examples
rtreedendrogramape-phylo

Trying to write data into newick format R


I have a dataset with different levels of branches starting from the top level: stock -> mbranch -> sbranch -> lsbranch. I want to be able to visualize these levels of my data into Newick format. I have different language groups within each stock level and would like to make different trees based off of these highest level groups.

For example my data is in the format as follows:

sample= data.frame("stock" = c("A", "A", "B", "B", "B"), "mbranch" = c("C", "D", "E", "F", "G"), "sbranch" = c("H", "O", NA, "K", "L"), "lsbranch" = c("I", "J", NA, "M", "N"), "name" = c("Andrea", "Kevin", "Charlie", "Naomi", "Sam"))

And I am trying to have an output of the newick tree format which would be something like:

tree = "(A(C(H(I(Andrew))),D(O(J(Kevin)))),B(E(Charlie),F(K(M(Naomi))),G(L(N(Sam)))));"
plot(read.dendrogram(tree))

I'm doing this so later on I can do a distance matrix of the nodes of my outputted tree.

Would the function write.tree be able to analyze data like this and make a tree from this (assuming my actual dataset is much larger)? Or in general, a function that would output the tree format. Thanks


Solution

  • You can use the ape::read.tree() function to read your newick format tree

    tree = "(A(C(H(I(Andrew))),D(O(J(Kevin)))),B(E(Charlie),F(K(M(Naomi))),G(L(N(Sam)))));"
    my_tree <- read.tree(text = tree)
    plot(my_tree)
    

    You can then use ape::write.tree to save the tree into a newick file:

    write.tree(my_tree, file = "my_file_name.tre")
    

    To convert your table into a "phylo" object from ape you can use this function (that might need some adjustements):

    ## The function
    data.frame.to.phylo <- function(sample){
        ## Making an edge table
        edge_table <- rbind(
            ## The root connecting A to B
            rbind(c("root", "A"),c("root", "B")),
            ## All the nodes connecting to the tips
            cbind(sample$stock, sample$name)
            )
    
        ## Translating the values in the edge table into edge IDs
        ## The order must be tips, root, nodes
        element_names <- c(unique(sample$name), "root", unique(sample$stock))
        element_ids   <- seq(1:length(element_names))
    
        ## Looping through each ID and name
        for(element in element_ids) {
            edge_table <- ifelse(edge_table == element_names[element], element_ids[element], edge_table)
        }
    
        ## Make numeric
        edge_table <- apply(edge_table, 2, as.numeric)
    
        ## Build the phylo object
        phylo_object <- list()
        phylo_object$edge <- edge_table
        phylo_object$tip.label <- unique(sample$name)
        phylo_object$node.label <- c("root", unique(sample$stock))
        phylo_object$Nnode <- length(phylo_object$node.label)
    
        ## Forcing the class to be "phylo"
        class(phylo_object) <- "phylo"
        return(phylo_object)
    }
    
    ## The data
    sample = data.frame("stock" = c("A", "A", "B", "B", "B"), "mbranch" = c("C", "D", "E", "F", "G"), "sbranch" = c("H", "O", NA, "K", "L"), "lsbranch" = c("I", "J", NA, "M", "N"), "name" = c("Andrea", "Kevin", "Charlie", "Naomi", "Sam"))
    
    ## Plotting the data.frame for testing the function
    plot(data.frame.to.phylo(sample))
    

    Cheers, Thomas