Search code examples
pythonmatplotlibseabornk-meanstf-idf

Plot centroids in K-Means using TF-IDF


I'm coding to group texts using KMeans and everything is working well, but I'm not able to plot the centroids together. I don't know how to use matplotlib, only seaborn along with the vector created by tdidf.

MiniBatchKMeans has the variable cluster_centers_, but I'm not able to use it in the image.

from sklearn.feature_extraction.text import TfidfVectorizer
df_abstracts = df_cleared['abstract'].tolist() # list with 33,000 lines of strings

tfidf = TfidfVectorizer(max_features=2**12, ngram_range=(1,4), stop_words = 'english')
vextorized = tfidf.fit_transform(df_abstracts)

#For the plot generation, I do this dimensionality reduction from 33,000 to 2.
from sklearn.decomposition import PCA
pca = PCA(n_components = 9)
X_pca = pca.fit_transform(vextorized.toarray())

from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(init='k-means++', n_clusters=4, max_iter=500, n_init=10, 
                         random_state=9)

y_pred = kmeans.fit_predict(vextorized)
np.unique(y_pred)

palette = sns.color_palette('bright', len(set(y_pred)))
sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=y_pred, legend='full', palette=palette)
plt.title('Clustered')

Solution

  • You did the k means clustering on the raw data, so to your centers projected onto the PCA space, you need to transform it again.

    I use an example dataset:

    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.decomposition import PCA
    from sklearn.cluster import MiniBatchKMeans
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    categories = ['rec.sport.baseball', 'sci.electronics',
                  'comp.os.ms-windows.misc', 'talk.politics.misc']
    
    newsgroups = fetch_20newsgroups(subset='train',
                                          categories=categories)
    
    X_train = newsgroups.data
    y_train = newsgroups.target
    
    tfidf = TfidfVectorizer(max_features=2**12, ngram_range=(1,4), stop_words = 'english')
    vextorized = tfidf.fit_transform(X_train)
    

    This part when you perform the pca, you need to retain the fit so that it can be use to project the kmeans centers:

    pca = PCA(n_components = 9).fit(vextorized.toarray())
    X_pca = pca.transform(vextorized.toarray())
    

    This is how the data with the actual labels look like:

    labels = [newsgroups.target_names[i] for i in y_train]
    sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=labels, legend='full',palette="Set2")
    

    enter image description here

    Now kmeans:

    kmeans = MiniBatchKMeans(init='k-means++', n_clusters=4, max_iter=500, n_init=10, 
                             random_state=777)
    y_pred = kmeans.fit_predict(vextorized)
    palette = sns.color_palette('bright', len(set(y_pred)))
    sns.scatterplot(X_pca[:,0], X_pca[:, 1], hue=y_pred, legend='full', palette=palette)
    plt.title('Clustered')
    

    We project the centers on the first 2 components and plot them:

    centers_on_PCs = pca.transform(kmeans.cluster_centers_)
    plt.scatter(x=centers_on_PCs[:,0],y=centers_on_PCs[:,1],s=200,c="k",marker="X")
    

    enter image description here