Search code examples
data-sciencecluster-analysisk-meansword2vec

How to see on what words the clusters were based on


I'm using this code to cluster my documents. The graph output looks like this: enter image description here I'm trying to accomplish a way of printing out the most common words on which the clusters were based on. Is it possible with gensim d2v?

kmeans_model = KMeans(n_clusters=3, init='k-means++', max_iter=100) 
X = kmeans_model.fit(d2v_model.dv.vectors)
labels=kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(d2v_model.dv.vectors)
pca = PCA(n_components=2).fit(d2v_model.dv.vectors)
datapoint = pca.transform(d2v_model.dv.vectors)

data['cluster'] = kmeans_model.labels_
clusters = data.groupby('cluster')    


for i, cluster in enumerate(clusters.groups):
    with open('cluster'+str(cluster)+ '.csv', 'w', encoding="utf-8", newline='') as f:
        data = data.replace(r'\r+|\n+|\t+',' ', regex=True)
        data = clusters.get_group(cluster)[['NR_SOLICITACAO','DS_ANALISE','PRE_PROCESSED']] # get title and overview columns
        f.write(data.to_csv(index_label='id')) # set index to id

import matplotlib.pyplot as plt
label1 = ["#0000FF", "#006400", "#FFFF00", "#CD5C5C", "#FF0000", "#FF1493"]
color = [label1[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker='^', s=150, c='#000000')
plt.show()

Solution

  • You seem to be using a Gensim Doc2Vec model (the 'Paragraph Vectors' algorithm), and your code shows the clusters being calculated from the .dv property. That's the full-document vectors, rather than individual word-vectors.

    The clusters so derived won't have the same sort of directly reportable relationship to individual words that (for example) LDA topics might have. There, the LDA model itself can directly report the topics implied by a word, or the words most inicative of a topic.

    But there are indirect ways you can probe for similar relationships from individual word-tokens to your groupings. Whether any gives acceptable results for your needs you'll have to try & evaluate. Some possibilities:

    1. Survey the sets-of-documents directly: You could simply tally up all the words in each of your clusters, individually - as if the entire cluster was one document. Then pick the words that are most-distinct, by some measure, in each cluster. This could be as simple as ordinally ranking all words by the number of times they appear in each cluster, & reporting the words whose rankings for one cluster are the most positions 'higher' than the same word in the other(s). Or, you could use other measure of term-importance, like TF-IDF.

    2. Doc-vectors to word-vector correlations: If you've used a Doc2Vec mode that also trains word-vectors at the same time – such as the PV-DM mode (dm=1), or using PV-DBOW (dm=0) while also adding the non-default dbow_words=1 argument – then the models' d2v_model.wv property will have valid word-vectors, in a compatible coordinate space.

    Thus you can look for words that are similar-to either the centroid points of your clusters, or similar-to a larger subset of each cluster (perhaps even every document), to get a sampling of words-that-may-be-descriptive.

    You do this by performing a .most_similar() on the .wv set-of-word-vectors, using doc-vectors (whether for a full document, or centroids of a cluster) as the positive example(s). EG:

    words_near_one_doc = d2v_model.wv.most_similar(positive=[doc_vec1])
    

    ...or...

    words_near_centroid_of_3_docs = d2v_model.wv.most_similar(positive=[doc_vec1, doc_vec2, doc_vec3])
    
    1. Probe with synthetic documents, such as one-word documents, or real documents perturbed by word additions/removals: You could go through your vocabulary, and use the Doc2Vec model to infer new doc-vectors for synthetic documents of just a single word each. (Consider far more epochs since the document lacks the normal size which would prompt lots of training before settling on a vector, and keep in mind this unnatural process, without the normal variety of a real document, might generate weird results). See where those degenerate cases of documents land among your pre-existing clusters, and consider them as labels. Or similarly, take existing documents – perhaps those especially near the boundaries between clusters – and try adding/removing words from them, & re-inferring new vectors. See which words most strongly 'move' a document towards or away-from your existing clusters, & consider using those words as positive or negative labels.