Search code examples
pythonscikit-learncluster-analysisk-meansword2vec

How to check the cluster details of a given vector in k-means in sklearn


I am using the following code to cluster my word vectors using k-means clustering algorithm.

from sklearn import cluster
model = word2vec.Word2Vec.load("word2vec_model")
X = model[model.wv.vocab]
clusterer = cluster.KMeans (n_clusters=6)
preds = clusterer.fit_predict(X)
centers = clusterer.cluster_centers_

Given a word in the word2vec vocabulary (e.g., word_vector = model['jeep']) I want to get its cluster ID and cosine distance to its cluster center.

I tried the following approach.

for i,j in enumerate(set(preds)):
    positions = X[np.where(preds == i)]
    print(positions)

However, it returns all the vectors in each cluster ID and not exactly what I am looking for.

I am happy to provide more details if needed.


Solution

  • After clustering you get the labels_ for all of your input data (in the same order as your input data), i.e. clusterer.labels_[model.wv.vocab['jeep'].index] would give you the cluster to which jeep belongs.

    You can calculate the cosine distance with with scipy.spatial.distance.cosine

    cluster_index = clusterer.labels_[model.wv.vocab['jeep'].index]
    print(distance.cosine(model['jeep'], centers[cluster_index]))
    >> 0.6935321390628815
    

    Full code

    I don't know how your model looks like but let's use GoogleNews-vectors-negative300.bin.

    from gensim.models import KeyedVectors
    from sklearn import cluster
    from scipy.spatial import distance
    
    model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
    
    # let's use a subset to accelerate clustering
    X = model[model.wv.vocab][:40000]
    
    clusterer = cluster.KMeans (n_clusters=6)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_
    
    cluster_index = clusterer.labels_[model.wv.vocab['jeep'].index]
    print(cluster_index, distance.cosine(model['jeep'], centers[cluster_index]))