I have a text corpus that contains 1000+ articles each in a separate line. I used Hierarchy Clustering using Sklearn in python to produce clusters of related articles. This is the code I used to do the clustering
Note: X, is a sparse NumPy 2D array with rows corresponding to documents and columns corresponding to terms
# Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(affinity="euclidean",linkage="complete",n_clusters=3)
model.fit(X.toarray())
clustering = model.labels_
print (clustering)
I specify the number of clusters = 3 at which to cut off the tree to get a flat clustering like K-mean
My question is : How to get the top N frequent words in each cluster? so that I can suggest a topic for each cluster. Thanks
One option is to convert X from the sparse numpy array to a pandas dataframe. The rows will still correspond to documents, and the columns to words. If you have a list of your vocabulary in order of your array columns (used as your_word_list
below) you could try something like this:
import pandas as pd
X = pd.DataFrame(X.toarray(), columns=your_word_list) # columns argument is optional
X['Cluster'] = clustering # Add column corresponding to cluster number
word_frequencies_by_cluster = X.groupby('Cluster').sum()
# To get sorted list for a numbered cluster, in this case 1
print word_frequencies_by_cluster.loc[1, :].sort(ascending=False)
As a side note, you may want to look into algorithms (e.g. LDA) and distance metrics (cosine) that are more commonly used for natural language processing. If you are looking to extract topics, there is a nice sklearn tutorial on topic modeling.