Search code examples
pythonpandascluster-analysisk-means

Visualise clusters with k-means


I have the following dataset:

    Date    Text
0   05/26/2020  è morto all'improvviso jk, aveva...
1   05/26/2020  è morto a 51 anni jk, attore, co...
2   05/26/2020  aveva 51 anni e si trovava in Italia. il rico...
3   05/26/2020  arriva a milano nel 1990 per una serie di conc...
4   05/26/2020  jk, l'attore e comico, e...
5   05/26/2020  spettacolo.it ha appreso che jk, l'...
6   05/26/2020  e' morto all'improvviso jk. cant...
7   05/26/2020  addio a jk . una morte improvvis...
8   05/26/2020  lutto nel mondo della televisione. è morto a 5...
9   05/26/2020  è morto all'età di 51 anni ...
10  05/26/2020  è morto all'età di 51 anni ...
11  05/26/2020  all'improvviso se ne è andato  ...
12  05/26/2020  è andato al supermercato  ...
13  05/26/2020  jk è morto improvvisamente a 51 ...
14  05/26/2020  è morto, a menfi, il 51enne jk...
15  05/26/2020  muore a cinquantuno anni jk, il ...

I would like to use clustering (k-mean) to create labels for classifying texts. I did as follows:

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords

stop_words = stopwords.words('italian')

def preprocessing(line):
    line = re.sub(r"[^a-zA-Z]", " ", line.lower())
    words = word_tokenize(line)
    words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
    return words_lemmed


vect =TfidfVectorizer(tokenizer=preprocessing)
vectorized_text=vect.fit_transform(df['Text'])
kmeans =KMeans(n_clusters=2).fit(vectorized_text)

Then

import string as st 
from pandas import Series, DataFrame

cl=kmeans.predict(vectorized_text)
df['Cluster']=pd.Series(cl, index=df.index)
df.groupby("Cluster").count()

I would like to know how to visualise the results. I have tried as follows:

plt.scatter(vectorized_text, cl)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()

but I have this error:

ValueError: x and y must be the same size

due to plt.scatter(vectorized_text, cl), so something is wrong there. Looking at possible solutions on the web, I found something by using the PCA. Should I consider it?

Thank you

UPDATE: After receiving the answer below, I have tried with:

plt.scatter(vectorized_text[:, 0] ,cl)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()

unfortunately I am still getting an error:

ValueError: x and y must be the same size


Solution

  • The shape of the x argument in plt.scatter() has to have dimension (n,) which is not the case here. You can only select one column of vectorized_text for the scatterplot, not all of them. Right now you x dimension is 209x1245, and your ydimensions is (209,)

    How to transform vectorized_text to a 1D array ?

    Spoiler : you can not! You first need to slice out one column from it, then convert it to a dense matrix (right now it is sparse matrix), and then cast it as an array.

    Let's assume you want to plot the fist columns from vectorized_text: what you need to give as x to plt.scatterplot is :

    np.asarray(vectorized_text[:, 0].todense())