Search code examples
pythonscikit-learnnlpk-meansword2vec

How to find a 'connection' between words for clustering sentences


I would need to connect the word 4G and mobile phones or Internet in order to cluster sentences about technology all together. I have the following sentences:

4G is the fourth generation of broadband network.
4G is slow. 
4G is defined as the fourth generation of mobile technology
I bought a new mobile phone. 

I need to consider in the same cluster the above sentences. Currently it does not, probably because it does not find a relation between 4G and mobile. I tried to use first wordnet.synsets to find synonyms for connecting 4G to Internet or mobile phone, but unfortunately it did not find any connection. To cluster the sentences I am doing as follows:

rom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy

texts = ["4G is the fourth generation of broadband network.",
    "4G is slow.",
    "4G is defined as the fourth generation of mobile technology",
    "I bought a new mobile phone."]

# vectorization of the sentences
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(texts)
words = vectorizer.get_feature_names()
print("words", words)


n_clusters=3
number_of_seeds_to_try=10
max_iter = 300
number_of_process=2 # seads are distributed
model = KMeans(n_clusters=n_clusters, max_iter=max_iter, n_init=number_of_seeds_to_try, n_jobs=number_of_process).fit(X)

labels = model.labels_
# indices of preferible words in each cluster
ordered_words = model.cluster_centers_.argsort()[:, ::-1]

print("centers:", model.cluster_centers_)
print("labels", labels)
print("intertia:", model.inertia_)

texts_per_cluster = numpy.zeros(n_clusters)
for i_cluster in range(n_clusters):
    for label in labels:
        if label==i_cluster:
            texts_per_cluster[i_cluster] +=1 

print("Top words per cluster:")
for i_cluster in range(n_clusters):
    print("Cluster:", i_cluster, "texts:", int(texts_per_cluster[i_cluster])),
    for term in ordered_words[i_cluster, :10]:
        print("\t"+words[term])

print("\n")
print("Prediction")

text_to_predict = "Why 5G is dangerous?"
Y = vectorizer.transform([text_to_predict])
predicted_cluster = model.predict(Y)[0]
texts_per_cluster[predicted_cluster]+=1

print(text_to_predict)
print("Cluster:", predicted_cluster, "texts:", int(texts_per_cluster[predicted_cluster])),
for term in ordered_words[predicted_cluster, :10]:
print("\t"+words[term])

Any help on this would be greatly appreciated it.


Solution

  • As @sergey-bushmanov's comment notes, dense word embeddings (as from word2vec or similar algorithms) may help.

    They will convert words to dense high-dimensional vectors, where words with similar meanings/usages are close to each other. And even: certain directions in space will often be roughly associated with the kinds of relationships between words.

    So, word-vectors trained on sufficiently-representative (large and varied) text will place the vectors for '4G' and 'mobile' somewhat near each other, and then if your sentence-representations are bootstrapped from word-vectors, that may help your clustering.

    One quick way to use individual word-vectors to model sentences is to use the average of all a sentence's word-vectors as the sentence vector. That's too simple to model many shades of meaning (especially those that come from grammar and word-order), but often works as a good baseline, especially for matters of broad topicality.

    Another calculation, "Word Mover's Distance", treats sentences as sets-of-word-vectors (without averaging them), and can do sentence-to-sentence distance calculations that work better than simple averages – but become very expensive to calculate for longer sentences.