python machine-learning scikit-learn cluster-analysis k-means

partially define initial centroid for scikit-learn K-Means clustering

Scikit documentation states that:

Method for initialization:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)

I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.

Solution

Sklearn does not allow you to perform this kind of fine operations.

The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.

So basically you can estimate a good value for this as follows:

import numpy as np
from sklearn.cluster import KMeans

nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )   

# your 6col centroids  
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) ) 

# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols

# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])

# For the 7th column you'll provide the average value 
# of the points laying on the cluster given by your partial centroids    
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
    init_7th = X[ np.where( initial_prediction == i ), 6].mean()
    cent_7cols[i,6] =  init_7th

# now you have initialized the 7th column with a Kmeans ++ alike 
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )