Search code examples
pythonmachine-learningscikit-learncluster-analysisk-means

partially define initial centroid for scikit-learn K-Means clustering


Scikit documentation states that:

Method for initialization:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)

I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.


Solution

  • Sklearn does not allow you to perform this kind of fine operations.

    The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.

    So basically you can estimate a good value for this as follows:

    import numpy as np
    from sklearn.cluster import KMeans
    
    nb_clust = 10
    # your data
    X = np.random.randn(7*1000).reshape( (1000,7) )   
    
    # your 6col centroids  
    cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) ) 
    
    # artificially fix your centroids
    km = KMeans( n_clusters=10 )
    km.cluster_centers_ = cent_6cols
    
    # find the points laying on each cluster given your initialization
    initial_prediction = km.predict(X[:,0:6])
    
    # For the 7th column you'll provide the average value 
    # of the points laying on the cluster given by your partial centroids    
    cent_7cols = np.zeros( (nb_clust,7) )
    cent_7cols[:,0:6] = cent_6cols
    for i in range(nb_clust):
        init_7th = X[ np.where( initial_prediction == i ), 6].mean()
        cent_7cols[i,6] =  init_7th
    
    # now you have initialized the 7th column with a Kmeans ++ alike 
    # So now you can use the cent_7cols as your centroids
    truekm = KMeans( n_clusters=10, init=cent_7cols )