python scikit-learn nlp k-means word2vec

KMeans clustering multidimensional features

Is it possible to train a Kmeans ML model using a multidimensional feature matrix?

I'm using sklearn and KmeansClass for clustering, Word2Vec for extracting the bag of words, and TreeTagger for the text pre-processing

from gensim.models import Word2Vec
from sklearn.cluster import KMeans

lemmatized_words = [["be", "information", "contract", "residential"], ["can", "send", "package", "recovery"]

w2v_model = Word2Vec.load(wiki_path_model)

bag_of_words = [w2v_model.wv(phrase) for phrase in lemmatized_words]

#
#
# bag_of_words = [array([[-0.08796783,  0.08373307,  0.04610106, ...,  0.41964772,
#        -0.1733183 ,  0.09438939],
#       [ 0.11526374,  0.09092105, -0.2086806 , ...,  0.5205145 ,
#        -0.11455593, -0.05190944],
#       [-0.05140354,  0.09938619,  0.07485678, ...,  0.73840886,
#        -0.17298238,  0.09994634],
#       ...,
#       [-0.01144416, -0.17129216, -0.04012141, ...,  0.05281362,
#        -0.23109615,  0.02297313],
#       [-0.08355679,  0.24799444,  0.04348441, ...,  0.27940673,
#        -0.14400786, -0.09187686],
#       [ 0.11022831,  0.11035886,  0.19900796, ...,  0.12891224,
#        -0.09379898,  0.10538024]],dtype=float32)
#       array([[ 1.73330009e-01,  1.26429915e-01, -3.47578406e-01, ...,
#         8.09064806e-02, -3.02738965e-01, -1.61911864e-02],
#       [ 2.47227158e-02, -6.48087710e-02, -1.97364464e-01, ...,
#         1.35158226e-01,  1.72204189e-02, -1.14456110e-01],
#       [ 8.07424933e-02,  2.69261692e-02, -4.22120057e-02, ...,
#         1.01349883e-01, -1.94084793e-01, -2.64464412e-04],
#       ...,
#       [ 1.36009008e-01,  1.50609210e-01, -2.59797573e-01, ...,
#         1.84113771e-01, -6.85161874e-02, -1.04138054e-01],
#       [ 4.83367145e-02,  1.17820159e-01, -2.43335906e-02, ...,
#         1.33836940e-01, -1.55749675e-02, -1.18981823e-01],
#       [-6.68482706e-02,  4.57039356e-01, -2.20365867e-01, ...,
#         2.95841128e-01, -1.55933857e-01,  7.39804050e-03]], dtype=float32)
#       ]
#
#

model = KMeans(algorithm='auto0', max_iter=300, n_clusters=2)

model.fit(bag_of_words)

I expect that the Kmeans is trained, so I can store the model and use for predictions, but I receive this error message:

ValueError: setting an array element with a sequence.

Solution

Your problem is in w2v_model.wv(phrase). Word2vec model, as it name implies, can be applied on the word level. To obtain phrase embeddings, you need to average (or aggregate in some other way) embeddings of all individial words in this phrase.

So you need to replace

bag_of_words = [w2v_model.wv(phrase) for phrase in lemmatized_words]

with

import numpy as np
bag_of_words = [np.mean([w2v_model.wv(word) for word in phrase], axis=0) for phrase in lemmatized_words]

For me, the following code snipped worked OK. It uses KeyedVectors instead of deprecated Word2Vec, but all the rest is the same.

from gensim.models import KeyedVectors
from sklearn.cluster import KMeans
import numpy as np
lemmatized_words = [["be", "information", "contract", "residential"], ["can", "send", "package", "recovery"]]
w2v_model = KeyedVectors.load_word2vec_format(wiki_path_model, binary=True)  
bag_of_words = np.array([np.mean([w2v_model[word] for word in phrase if word in w2v_model], axis=0) for phrase in lemmatized_words])
print(bag_of_words.shape) # it should give (2, 300) for a 300-dimensional w2v
model = KMeans( max_iter=300, n_clusters=2)
model.fit(bag_of_words)

Of course, averaging (or other aggregation) discards some information about words, and this information might be meaningful for clustering. But without aggregation, you cannot get comparable phrase embeddings, because different phrases may have different lengths.

If your clustering of average embeddings fails, I would recommend to look for pretrained sentence embeddings (e.g. Google's Universal Sentence Encoder, or maybe embeddings from BERT).