python scikit-learn data-mining hierarchical-clustering sklearn-pandas

How to get the top N frequent words in each cluster? Sklearn

I have a text corpus that contains 1000+ articles each in a separate line. I used Hierarchy Clustering using Sklearn in python to produce clusters of related articles. This is the code I used to do the clustering

Note: X, is a sparse NumPy 2D array with rows corresponding to documents and columns corresponding to terms

# Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(affinity="euclidean",linkage="complete",n_clusters=3)
model.fit(X.toarray())
clustering = model.labels_
print (clustering)

I specify the number of clusters = 3 at which to cut off the tree to get a flat clustering like K-mean

My question is : How to get the top N frequent words in each cluster? so that I can suggest a topic for each cluster. Thanks

Solution

One option is to convert X from the sparse numpy array to a pandas dataframe. The rows will still correspond to documents, and the columns to words. If you have a list of your vocabulary in order of your array columns (used as your_word_list below) you could try something like this:

import pandas as pd

X = pd.DataFrame(X.toarray(), columns=your_word_list)  # columns argument is optional
X['Cluster'] = clustering  # Add column corresponding to cluster number
word_frequencies_by_cluster = X.groupby('Cluster').sum()

# To get sorted list for a numbered cluster, in this case 1
print word_frequencies_by_cluster.loc[1, :].sort(ascending=False)

As a side note, you may want to look into algorithms (e.g. LDA) and distance metrics (cosine) that are more commonly used for natural language processing. If you are looking to extract topics, there is a nice sklearn tutorial on topic modeling.