Search code examples
pythonscikit-learnk-meanstext-miningtfidfvectorizer

Scikit Learn K-means Clustering & TfidfVectorizer: How to pass top n terms with highest tf-idf score to k-means


I am clustering the text data based on TFIDF vectorizer. The code works fine. It takes entire TFIDF vectorizer output as input to the K-Means clustering and generate a scatter plots. Instead I would like to send only top n-terms based on TF-IDF scores as input to the k-means clustering. Is there a way to achieve that ?

vect = TfidfVectorizer(ngram_range=(1,3),stop_words='english')

tfidf_matrix = vect.fit_transform(df_doc_wholetext['csv_text'])


'''create k-means model with custom config '''
clustering_model = KMeans(
    n_clusters=num_clusters,
    max_iter=max_iterations,
    precompute_distances="auto",
    n_jobs=-1
)

labels = clustering_model.fit_predict(tfidf_matrix)

x = tfidf_matrix.todense()

reduced_data = PCA(n_components=pca_num_components).fit_transform(x)


fig, ax = plt.subplots()
for index, instance in enumerate(reduced_data):        
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels[index]]
    ax.scatter(pca_comp_1,pca_comp_2, c = color)
plt.show()

Solution

  • use max_features in TfidfVectorizer to consider the top n features

    vect = TfidfVectorizer(ngram_range=(1,3),stop_words='english', max_features=n)
    

    According to scikit-learn's documentation, max_features takes values of int or None (default=None). If not None, TfidfVectorizer builds a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

    Here is the link