python matplotlib cluster-analysis k-means tf-idf

Cluster using different colours and labels

I am working on text clustering. I would need to plot the data using different colours. I used kmeans method for clustering and tf-idf for similarity.

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

plt.scatter(data2D[:,0], data2D[:,1])

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])

Currently, my output looks like: there are a few elements as it is a test. I would need to add labels (they are strings) and differentiate dots by clusters: each cluster should have its own colour to make the reader easy to analyse the chart.

Could you please tell me how to change my code in order to include both labels and colours? I think any example it would be great.

A sample of my dataset is (the output above was generated from a different sample):

Sentences

Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.

Solution

We can use an example dataset:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

newsgroups = fetch_20newsgroups(subset='train',
                                categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target

pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

And do KMeans like you did, obtaining the clusters and centers, so just adding a name for the cluster:

kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]

You can add the colors by providing the cluster to "c=" and calling a colormap from cm or defining you own map:

plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
    plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")

You can also consider using seaborn:

sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")