Search code examples
pythonscikit-learnhierarchical-clustering

sci-kit learn agglomerative clustering error


I am trying to do agglomerative clustering using sklearn. In the fitting step, I get this error. The error doesn't show up all the time, if I change the number of datapoints then I may not get the error and the agglomerative clustering. I'm not too sure how to debug this. I've ensured that there is no NaN values in my data array already with fillnan. Any ideas of why this might be happening would be helpful.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-8acbe956f76e> in <module>()
     13     agg = AgglomerativeClustering(n_clusters=k,affinity="euclidean",linkage="ward")
     14     init = time.time()
---> 15     agg.fit(data)
     16     atime = time.time()
     17     labels = agg.labels_

C:\Python27\lib\site-packages\sklearn\cluster\hierarchical.pyc in fit(self, X, y)
    754                                        n_components=self.n_components,
    755                                        n_clusters=n_clusters,
--> 756                                        **kwargs)
    757         # Cut the tree
    758         if compute_full_tree:

C:\Python27\lib\site-packages\sklearn\externals\joblib\memory.pyc in __call__(self, *args, **kwargs)
    279 
    280     def __call__(self, *args, **kwargs):
--> 281         return self.func(*args, **kwargs)
    282 
    283     def call_and_shelve(self, *args, **kwargs):

C:\Python27\lib\site-packages\sklearn\cluster\hierarchical.pyc in ward_tree(X, connectivity, n_components, n_clusters, return_distance)
    189                           'for the specified number of clusters',
    190                           stacklevel=2)
--> 191         out = hierarchy.ward(X)
    192         children_ = out[:, :2].astype(np.intp)
    193 

C:\Python27\lib\site-packages\scipy\cluster\hierarchy.pyc in ward(y)
    463 
    464     """
--> 465     return linkage(y, method='ward', metric='euclidean')
    466 
    467 

C:\Python27\lib\site-packages\scipy\cluster\hierarchy.pyc in linkage(y, method, metric)
    662             Z = np.zeros((n - 1, 4))
    663             _hierarchy.linkage(dm, Z, n,
--> 664                                int(_cpy_euclid_methods[method]))
    665     return Z
    666 

scipy\cluster\_hierarchy.pyx in scipy.cluster._hierarchy.linkage (scipy\cluster\_hierarchy.c:8759)()

C:\Python27\lib\site-packages\scipy\cluster\_hierarchy.pyd in View.MemoryView.memoryview_copy_contents (scipy\cluster\_hierarchy.c:22026)()

C:\Python27\lib\site-packages\scipy\cluster\_hierarchy.pyd in View.MemoryView._err_extents (scipy\cluster\_hierarchy.c:21598)()

ValueError: got differing extents in dimension 0 (got 704882705 and 4999850001)

Solution

  • This is an overflowing problem, note that 4999850001 - 2**32 = 704882705 (last line of your output). Something is too big to fit in a 32-bit integer. You should try using fewer data points.