python-2.7 scikit-learn cluster-analysis kdtree

Clusterint 2D points using sklearn KDTree

I have an array of (n_sample x 2) and I want to cluster them using KDTree in sklearn.neighbors.KDTree.

I have this sample piece of code:

from sklearn.neighbors import KDTree
import numpy as np
np.random.seed(0)
X = np.random.random((10, 2))
tree = KDTree(X, leaf_size=2)

Now I want to extract the points in the leaves of the tree so that each leaf can be a cluster. Points that are in the same leaf belong to the same cluster.

In the above example because the maximum leaf_size is 2, we'll have about 10 / 2 = 5 clusters.

What I desire is that given a point in X (e.g. X[0]) the tree can give me the index of the leaf of the tree that the points belongs to.

Solution

The maximum leaf size 2 means you can have anywhere from n to n/2 users per leaf. But you forgot about the non-leaf nodes.

A kd-tree will have 1 element in the root, 2 in the second layer (that are not close), and then you will have 4 leaf nodes with the remaining 7 objects. So by looking on the leaves only, you lost three objects.

A kd-tree does not attempt to cluster points. It's perfectly valid for a kd-tree to have the exact same coordinates in two nodes! The reference you gave used the kd-tree solely to get an adaptive grid. I don't think it is a very good approach, but it is very easy. You should just implement it yourself, so you don't build the full tree, and don't put objects into non-leaf nodes.