Search code examples

Python make newick format using dataframe with 0s and 1s

I have a dataframe like this

      a b c d e f g h i j k l m
mut1  0 0 0 0 0 1 1 1 1 1 1 1 1 
mut2  0 0 0 0 0 1 1 1 1 1 0 0 0 
mut3  0 0 0 0 0 1 1 0 0 0 0 0 0 
mut4  0 0 0 0 0 1 0 0 0 0 0 0 0 
mut5  0 0 0 0 0 0 0 1 1 0 0 0 0 
mut6  0 0 0 0 0 0 0 1 0 0 0 0 0 
mut7  0 0 0 0 0 0 0 0 0 1 0 0 0 
mut8  0 0 0 0 0 0 0 0 0 0 1 1 1 
mut9  0 0 0 0 0 0 0 0 0 0 1 1 0 
mut10 0 0 0 0 0 0 0 0 0 0 0 0 1 
mut11 1 1 1 1 1 0 0 0 0 0 0 0 0 
mut12 1 1 1 0 0 0 0 0 0 0 0 0 0 
mut13 1 1 0 0 0 0 0 0 0 0 0 0 0 
mut14 1 0 0 0 0 0 0 0 0 0 0 0 0 
mut15 0 0 0 1 0 0 0 0 0 0 0 0 0 
mut16 0 0 0 0 1 0 0 0 0 0 0 0 0 

and origianl corresponding string


The algorithm I thought was like this.

In row mut1, we can see that f,g,h,i,j,k,l,m have the same features. So the string can be modified into


In row mut2, we can see that f,g,h,i,j have the same features. So the string can be modified into


Until mut10, it continues to cluster samples in f,g,h,i,j,k,l,m.

And the output will be


(For a row with one "1", just skip the process)

From mut10, it stars to cluster samples a,b,c,d,e

and similarly, the final output will be


So the algorithm is

  1. Cluster the samples with the same features.
  2. After clustering, add ":0" behind the closing parenthesis.

Any suggestions on this process?

*p.s. I have uploaded similar question Creating a newick format from dataframe with 0 and 1 but this one is more detailed.


  • Your question asks for a solution in Python, which I'm not familiar with. Hopefully, the following procedure in R will be helpful as well.

    What your question describes is matrix representation of a tree. Such a tree can be retrieved from the matrix with a maximum parsimony method using the phangorn package. To manipulate trees in R, newick format is useful. Newick differs from the tree representation in your question by ending with a semicolon.

    First, prepare a starting tree in phylo format.

    tree0 <- read.tree(text = "(a,b,c,d,e,f,g,h,i,j,k,l,m);")

    Second, convert your data.frame to a phyDat object, where the rows represent samples and columns features. The phyDat object also requires what levels are present in the data, which is 0 and 1 in this case. Combining the starting tree with the data, we calculate the maximum parsimony tree.

    dat0 = read.table(text = "      a b c d e f g h i j k l m
        mut1  0 0 0 0 0 1 1 1 1 1 1 1 1 
        mut2  0 0 0 0 0 1 1 1 1 1 0 0 0 
        mut3  0 0 0 0 0 1 1 0 0 0 0 0 0 
        mut4  0 0 0 0 0 1 0 0 0 0 0 0 0 
        mut5  0 0 0 0 0 0 0 1 1 0 0 0 0 
        mut6  0 0 0 0 0 0 0 1 0 0 0 0 0 
        mut7  0 0 0 0 0 0 0 0 0 1 0 0 0 
        mut8  0 0 0 0 0 0 0 0 0 0 1 1 1 
        mut9  0 0 0 0 0 0 0 0 0 0 1 1 0 
        mut10 0 0 0 0 0 0 0 0 0 0 0 0 1 
        mut11 1 1 1 1 1 0 0 0 0 0 0 0 0 
        mut12 1 1 1 0 0 0 0 0 0 0 0 0 0 
        mut13 1 1 0 0 0 0 0 0 0 0 0 0 0 
        mut14 1 0 0 0 0 0 0 0 0 0 0 0 0 
        mut15 0 0 0 1 0 0 0 0 0 0 0 0 0 
        mut16 0 0 0 0 1 0 0 0 0 0 0 0 0")
    dat1 <- phyDat(data = t(dat0), 
        type = "USER",
        levels = c(0, 1))
    tree1 <- optim.parsimony(tree = tree0, data = dat1)

    enter image description here

    The tree now contains a cladogram with no branch lengths. Class phylo is effectively a list, so the zero branch lengths can be added as an extra element.

    tree2 <- tree1
    tree2$edge.length <- rep(0, nrow(tree2$edge))

    Last, we write the tree into a character vector in newick format and remove the semicolon at the end to match the requirement.

    tree3 <- write.tree(tree2)
    tree3 <- sub(";", "", tree3)
    # [1] "((e:0,d:0):0,(c:0,(b:0,a:0):0):0,((m:0,(l:0,k:0):0):0,((i:0,h:0):0,j:0,(g:0,f:0):0):0):0)"