Is it possible to train a Kmeans ML model using a multidimensional feature matrix?
I'm using sklearn and KmeansClass for clustering, Word2Vec for extracting the bag of words, and TreeTagger for the text pre-processing
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
lemmatized_words = [["be", "information", "contract", "residential"], ["can", "send", "package", "recovery"]
w2v_model = Word2Vec.load(wiki_path_model)
bag_of_words = [w2v_model.wv(phrase) for phrase in lemmatized_words]
#
#
# bag_of_words = [array([[-0.08796783, 0.08373307, 0.04610106, ..., 0.41964772,
# -0.1733183 , 0.09438939],
# [ 0.11526374, 0.09092105, -0.2086806 , ..., 0.5205145 ,
# -0.11455593, -0.05190944],
# [-0.05140354, 0.09938619, 0.07485678, ..., 0.73840886,
# -0.17298238, 0.09994634],
# ...,
# [-0.01144416, -0.17129216, -0.04012141, ..., 0.05281362,
# -0.23109615, 0.02297313],
# [-0.08355679, 0.24799444, 0.04348441, ..., 0.27940673,
# -0.14400786, -0.09187686],
# [ 0.11022831, 0.11035886, 0.19900796, ..., 0.12891224,
# -0.09379898, 0.10538024]],dtype=float32)
# array([[ 1.73330009e-01, 1.26429915e-01, -3.47578406e-01, ...,
# 8.09064806e-02, -3.02738965e-01, -1.61911864e-02],
# [ 2.47227158e-02, -6.48087710e-02, -1.97364464e-01, ...,
# 1.35158226e-01, 1.72204189e-02, -1.14456110e-01],
# [ 8.07424933e-02, 2.69261692e-02, -4.22120057e-02, ...,
# 1.01349883e-01, -1.94084793e-01, -2.64464412e-04],
# ...,
# [ 1.36009008e-01, 1.50609210e-01, -2.59797573e-01, ...,
# 1.84113771e-01, -6.85161874e-02, -1.04138054e-01],
# [ 4.83367145e-02, 1.17820159e-01, -2.43335906e-02, ...,
# 1.33836940e-01, -1.55749675e-02, -1.18981823e-01],
# [-6.68482706e-02, 4.57039356e-01, -2.20365867e-01, ...,
# 2.95841128e-01, -1.55933857e-01, 7.39804050e-03]], dtype=float32)
# ]
#
#
model = KMeans(algorithm='auto0', max_iter=300, n_clusters=2)
model.fit(bag_of_words)
I expect that the Kmeans is trained, so I can store the model and use for predictions, but I receive this error message:
ValueError: setting an array element with a sequence.
Your problem is in w2v_model.wv(phrase)
. Word2vec model, as it name implies, can be applied on the word level. To obtain phrase embeddings, you need to average (or aggregate in some other way) embeddings of all individial words in this phrase.
So you need to replace
bag_of_words = [w2v_model.wv(phrase) for phrase in lemmatized_words]
with
import numpy as np
bag_of_words = [np.mean([w2v_model.wv(word) for word in phrase], axis=0) for phrase in lemmatized_words]
For me, the following code snipped worked OK. It uses KeyedVectors
instead of deprecated Word2Vec
, but all the rest is the same.
from gensim.models import KeyedVectors
from sklearn.cluster import KMeans
import numpy as np
lemmatized_words = [["be", "information", "contract", "residential"], ["can", "send", "package", "recovery"]]
w2v_model = KeyedVectors.load_word2vec_format(wiki_path_model, binary=True)
bag_of_words = np.array([np.mean([w2v_model[word] for word in phrase if word in w2v_model], axis=0) for phrase in lemmatized_words])
print(bag_of_words.shape) # it should give (2, 300) for a 300-dimensional w2v
model = KMeans( max_iter=300, n_clusters=2)
model.fit(bag_of_words)
Of course, averaging (or other aggregation) discards some information about words, and this information might be meaningful for clustering. But without aggregation, you cannot get comparable phrase embeddings, because different phrases may have different lengths.
If your clustering of average embeddings fails, I would recommend to look for pretrained sentence embeddings (e.g. Google's Universal Sentence Encoder, or maybe embeddings from BERT).