I am clustering the text data based on TFIDF vectorizer. The code works fine. It takes entire TFIDF vectorizer output as input to the K-Means clustering and generate a scatter plots. Instead I would like to send only top n-terms based on TF-IDF scores as input to the k-means clustering. Is there a way to achieve that ?
vect = TfidfVectorizer(ngram_range=(1,3),stop_words='english')
tfidf_matrix = vect.fit_transform(df_doc_wholetext['csv_text'])
'''create k-means model with custom config '''
clustering_model = KMeans(
n_clusters=num_clusters,
max_iter=max_iterations,
precompute_distances="auto",
n_jobs=-1
)
labels = clustering_model.fit_predict(tfidf_matrix)
x = tfidf_matrix.todense()
reduced_data = PCA(n_components=pca_num_components).fit_transform(x)
fig, ax = plt.subplots()
for index, instance in enumerate(reduced_data):
pca_comp_1, pca_comp_2 = reduced_data[index]
color = labels_color_map[labels[index]]
ax.scatter(pca_comp_1,pca_comp_2, c = color)
plt.show()
use max_features in TfidfVectorizer to consider the top n features
vect = TfidfVectorizer(ngram_range=(1,3),stop_words='english', max_features=n)
According to scikit-learn's documentation, max_features takes values of int or None (default=None). If not None, TfidfVectorizer builds a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
Here is the link