Search code examples

kmeans scatter plot: plot different colors per cluster

I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color.

sentence_list=["Hi how are you", "Good morning" ...] #i have 10 setences
km = KMeans(n_clusters=5, init='k-means++',n_init=10, verbose=1) 
#with 5 cluster, i want 5 different colors
km.labels_ # [0,1,2,3,3,4,4,5,2,5]

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(sentence_list).todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])
centers2D = pca.transform(km.cluster_centers_)
print labels

My problem is in the bottom code for plt.scatter(); what should i use for the parameter c?

  1. when i use c=labels in the code, i get this error:

number in rbg sequence outside 0-1 range

2.When i set c= km.labels_ instead, i get the error:

ValueError: Color array must be two-dimensional

plt.scatter(centers2D[:,0], centers2D[:,1], 
            marker='x', s=200, linewidths=3, c=labels)


  • The color= or c= property should be a matplotlib color, as mentioned in the documentation for plot.

    To map a integer label to a color just do

    LABEL_COLOR_MAP = {0 : 'r',
                       1 : 'k',
    label_color = [LABEL_COLOR_MAP[l] for l in labels]
    plt.scatter(x, y, c=label_color)

    If you don't want to use the builtin one-character color names, you can use other color definitions. See the documentation on matplotlib colors.