I would need to connect the word 4G
and mobile phones
or Internet
in order to cluster sentences about technology all together.
I have the following sentences:
4G is the fourth generation of broadband network.
4G is slow.
4G is defined as the fourth generation of mobile technology
I bought a new mobile phone.
I need to consider in the same cluster the above sentences. Currently it does not, probably because it does not find a relation between 4G and mobile.
I tried to use first wordnet.synsets
to find synonyms for connecting 4G to Internet or mobile phone, but unfortunately it did not find any connection.
To cluster the sentences I am doing as follows:
rom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy
texts = ["4G is the fourth generation of broadband network.",
"4G is slow.",
"4G is defined as the fourth generation of mobile technology",
"I bought a new mobile phone."]
# vectorization of the sentences
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(texts)
words = vectorizer.get_feature_names()
print("words", words)
n_clusters=3
number_of_seeds_to_try=10
max_iter = 300
number_of_process=2 # seads are distributed
model = KMeans(n_clusters=n_clusters, max_iter=max_iter, n_init=number_of_seeds_to_try, n_jobs=number_of_process).fit(X)
labels = model.labels_
# indices of preferible words in each cluster
ordered_words = model.cluster_centers_.argsort()[:, ::-1]
print("centers:", model.cluster_centers_)
print("labels", labels)
print("intertia:", model.inertia_)
texts_per_cluster = numpy.zeros(n_clusters)
for i_cluster in range(n_clusters):
for label in labels:
if label==i_cluster:
texts_per_cluster[i_cluster] +=1
print("Top words per cluster:")
for i_cluster in range(n_clusters):
print("Cluster:", i_cluster, "texts:", int(texts_per_cluster[i_cluster])),
for term in ordered_words[i_cluster, :10]:
print("\t"+words[term])
print("\n")
print("Prediction")
text_to_predict = "Why 5G is dangerous?"
Y = vectorizer.transform([text_to_predict])
predicted_cluster = model.predict(Y)[0]
texts_per_cluster[predicted_cluster]+=1
print(text_to_predict)
print("Cluster:", predicted_cluster, "texts:", int(texts_per_cluster[predicted_cluster])),
for term in ordered_words[predicted_cluster, :10]:
print("\t"+words[term])
Any help on this would be greatly appreciated it.
As @sergey-bushmanov's comment notes, dense word embeddings (as from word2vec or similar algorithms) may help.
They will convert words to dense high-dimensional vectors, where words with similar meanings/usages are close to each other. And even: certain directions in space will often be roughly associated with the kinds of relationships between words.
So, word-vectors trained on sufficiently-representative (large and varied) text will place the vectors for '4G'
and 'mobile'
somewhat near each other, and then if your sentence-representations are bootstrapped from word-vectors, that may help your clustering.
One quick way to use individual word-vectors to model sentences is to use the average of all a sentence's word-vectors as the sentence vector. That's too simple to model many shades of meaning (especially those that come from grammar and word-order), but often works as a good baseline, especially for matters of broad topicality.
Another calculation, "Word Mover's Distance", treats sentences as sets-of-word-vectors (without averaging them), and can do sentence-to-sentence distance calculations that work better than simple averages – but become very expensive to calculate for longer sentences.