I have converted a whole heap of PDF documents to text and then compiled them to a dict, I know for a fact I have 3 distinct document types and I want to use Clustering to group them automatically:
dict_of_docs = {'document_1':'contents of document', 'document_2':'contents of document', 'document_3':'contents of document',...'document_100':'contents of document'}
Then, I vectored the values of my dictionary:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
My output of X is something like this:
(0, 768) 0.05895270500636258
(0, 121) 0.11790541001272516
(0, 1080) 0.05895270500636258
(0, 87) 0.2114378682212116
(0, 1458) 0.1195944498355368
(0, 683) 0.0797296332236912
(0, 1321) 0.12603709835806634
(0, 630) 0.12603709835806634
(0, 49) 0.12603709835806634
(0, 750) 0.12603709835806634
(0, 1749) 0.10626171032944469
(0, 478) 0.12603709835806634
(0, 1632) 0.14983692373373858
(0, 177) 0.12603709835806634
(0, 653) 0.0497440271723707
(0, 1268) 0.13342186854440274
(0, 1489) 0.07052056544031632
(0, 72) 0.12603709835806634
...etc etc
Then, I converted them to an array, X = X.toarray()
I am at this stage now of trying to use my real data to scatter plot a cluster with matplotlib. From there I then want to use what I have learned with clustering to sort the documents. All the guides I have followed use made up data arrays, but they don't show how to go from real world data to something that can be used in the way they have demonstrated.
How do I get my array of vectorised data into a scatter plot?
How do I get my array of vectorised data into a scatter plot?
In few steps: clustering, dimensionality reduction, plotting and debugging.
We use K-Means to fit X
(our TF-IDF vectorized dataset).
from sklearn.cluster import KMeans
NUMBER_OF_CLUSTERS = 3
km = KMeans(
n_clusters=NUMBER_OF_CLUSTERS,
init='k-means++',
max_iter=500)
km.fit(X)
from sklearn.decomposition import PCA
# First: for every document we get its corresponding cluster
clusters = km.predict(X)
# We train the PCA on the dense version of the tf-idf.
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X.todense())
scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component
We plot every cluster with a pre-assigned colour.
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}
# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
ix = np.where(clusters == group)
ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)
ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()
Print top 10 words in every cluster.
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(3):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
# Cluster 0: com edu medical yeast know cancer does doctor subject lines
# Cluster 1: edu game games team baseball com year don pitcher writes
# Cluster 2: edu car com subject organization lines university writes article