python scipy hierarchical-clustering phylogeny

Newick tree representation to scipy.cluster.hierarchy linkage matrix format

I have a set of genes which have been aligned and clustered based on DNA sequences, and I have this set of genes in a Newick tree representation (https://en.wikipedia.org/wiki/Newick_format). Does anyone know how to convert this format to the scipy.cluster.hierarchy.linkage matrix format? From the scipy docs for the linkage matrix:

A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n+i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

At least from the scipy docs, their description of how this linkage matrix is structured is rather confusing. What do they mean by "iteration"? Also, how does this representation keep track of which original observations are in which cluster?

I would like to figure out how to do this conversion as the results of other cluster analyses in my project have been done with the scipy representation, and I've been using it for plotting purposes.

Solution

I got how the linkage matrix is generated from the tree representation, thanks @cel for clarification. Let's take the example from the Newick wiki page (https://en.wikipedia.org/wiki/Newick_format)

The tree, in string format is:

(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

First, one should compute the distances between all of the leaves. If for example, we wish to compute the distance A and B, the method is to traverse the tree from A to B through the nearest branch. Since in the Newick format, we are given the distance between each leaf and the branch, the distance from A to B is simply 0.1 + 0.2 = 0.3. For A to D, we would have to do 0.1 + (0.5 + 0.4) = 1.0, since the distance from D to the nearest branch is given as 0.4, and the distance from D's branch to A's is 0.5. Thus the distance matrix looks like this (with indexing A=0, B=1, C=2, D=3):

distance_matrix=
 [[0.0, 0.3, 0.9, 1.0],
  [0.3, 0.0, 1.0, 1.1],
  [0.9, 1.0, 0.0, 0.7],
  [1.0, 1.1, 0.1, 0.0]]

From here, the linkage matrix is easy to find. Since we already have n=4 clusters (A,B,C,D) as original observations, we need to find the additional n-1 clusters of the tree. Each step simply combines two clusters into a new one, and we take the two clusters that are closest to each other. In this case, A and B are closest together, so the first row of the linkage matrix will look like this:

[A,B,0.3,2]

From now on, we treat A & B as one cluster whose distance to the nearest branch is the distance between A & B.

Now we have 3 clusters left, AB, C, and D. We can update the distance matrix to see which clusters are closest together. Let AB have index 0 in the updated distance matrix.

distance_matrix=
[[0.0, 1.1, 1.2],
 [1.1, 0.0, 0.7],
 [1.2, 0.7, 0.0]]

We can now see that C and D are closest to each other, so let's combine them into a new cluster. The second row in the linkage matrix will now be

[C,D,0.7,2]

Now, we only have two clusters left, AB and CD. The distance from these clusters to the root branch is 0.3 and 0.7 respectively, so their distance is 1.0. The final row of the linkage matrix will be:

[AB,CD,1.0,4]

Now, the scipy matrix wouldn't actually have the strings in place as I've shown here, we would have the use the indexing scheme, since we combined A and B first, AB would have index 4 and CD would have index 5. So the actual result we should see in the scipy linkage matrix would be:

[[0,1,0.3,2],
 [2,3,0.7,2],
 [4,5,1.0,4]]

This is the general way to get from the tree representation to the scipy linkage matrix representation. However, there already exist tools from other python packages to read in trees in Newick format, and from these, we can fairly easily find the distance matrix, and then pass that to scipy's linkage function. Below is a little script that does exactly that for this example.

from ete2 import ClusterTree, TreeStyle
import scipy.cluster.hierarchy as sch
import scipy.spatial.distance
import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations


tree = ClusterTree('(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);')
leaves = tree.get_leaf_names()
ts = TreeStyle()
ts.show_leaf_name=True
ts.show_branch_length=True
ts.show_branch_support=True

idx_dict = {'A':0,'B':1,'C':2,'D':3}
idx_labels = [idx_dict.keys()[idx_dict.values().index(i)] for i in range(0, len(idx_dict))]

#just going through the construction in my head, this is what we should get in the end
my_link = [[0,1,0.3,2],
        [2,3,0.7,2],
        [4,5,1.0,4]]

my_link = np.array(my_link)


dmat = np.zeros((4,4))

for l1,l2 in combinations(leaves,2):
    d = tree.get_distance(l1,l2)
    dmat[idx_dict[l1],idx_dict[l2]] = dmat[idx_dict[l2],idx_dict[l1]] = d

print 'Distance:'
print dmat


schlink = sch.linkage(scipy.spatial.distance.squareform(dmat),method='average',metric='euclidean')

print 'Linkage from scipy:'
print schlink

print 'My link:'
print my_link

print 'Did it right?: ', schlink == my_link

dendro = sch.dendrogram(my_link,labels=idx_labels)
plt.show()

tree.show(tree_style=ts)