Search code examples
pythoncluster-analysisk-meanskdtree

Is kdtree used for speeding k-means clustering or not?


I am doing a project by using k-means and my professor suggested kdtree. I found this implementation of kdtree in python (i know that there is also in scipy, but i couldn't find any sample implementation). My question is the same as the tittle, is it used kdtree for speeding up k-means, or i am wrong?

data = [(2,2),(1,0),(2,3),(10,5),(59,8),(4,2)]

tree = KDTree.construct_from_data(data)
nearest = tree.query(query_point=(5,4), t=3)
print nearest

output:

[(4, 2), (2, 3), (2, 2)]

Solution

  • As "Making k-means even faster", p 137, paper shows, kd-tree can be used for k means algorithm speed up for low-dimensional data, while straightforward Lloyd's algorithm is more efficient for higher dimension.

    With high-dimensional data, indexing schemes such as k-d tree do not work well

    See explanations in paper.

    I suggest you to use one of established k-means implementation and worry about speed improvement only if you run into severe issues. For example, afaik, sklearn's KMeans is based on Lloyd's original algorithm.