I am trying to do a clustering analysis (preferably k-means) of poetry words on a pandas dataframe. I am firstly trying to vectorize the words by using the word-to-vector feature in the gensim package. However, the vectors just come out with 0s, so my code is failing to translate the words into vectors. As a result, the clustering doesn't work. Here is my code:
# create a gensim model
model = gensim.models.Word2Vec(vector_size=100)
# copy original pandas dataframe with poems
data = poems.copy(deep=True)
# get data ready for kmeans clustering
final_data = [] # empty list
for i, row in data.iterrows():
poem_vectorized = []
poem = row['Main_text']
poem_all_words = poem.split(sep=" ")
for poem_w in poem_all_words: #iterate through list of words
try:
poem_vectorized.append(list(model.wv[poem_w]))
except Exception as e:
pass
try:
poem_vectorized = np.asarray(poem_vectorized)
poem_vectorized_mean = list(np.mean(poem_vectorized, axis=0))
except Exception as e:
poem_vectorized_mean = list(np.zeros(100))
pass
try:
len(poem_vectorized_mean)
except:
poem_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(poem_vectorized_mean)
final_data.append(temp_row)
X = np.asarray(final_data)
print(X)
At closer inspection of:
poem_vectorized.append(list(model.wv[poem_w]))
If I understand it correctly you want to use an existing model to get the semantic embeddings of the tokens and then cluster the words, right?
Because the way you set the model up you are preparing a new model for training, but then don't feed any training data to it and train it, so your model doesn't know any words and just always throws a KeyError when calling model.wv[poem_w]
.
Use gensim.downloader
to load an existing model (check out their repository for a list of all available models):
import gensim.downloader as api
import numpy as np
import pandas
poems = pandas.DataFrame({"Main_text": ["This is a sample poem.", "This is another sample poem."]})
model = api.load("glove-wiki-gigaword-100")
Then use it to retrieve the vectors for all words the models knows:
final_data = []
for poem in poems['Main_text']:
poem_all_words = poem.split()
poem_vectorized = []
for poem_w in poem_all_words:
if poem_w in model:
poem_vectorized.append(model[poem_w])
poem_vectorized_mean = np.mean(poem_vectorized, axis=0)
final_data.append(poem_vectorized_mean)
Or as list comprehension:
final_data = []
for poem in poems['Main_text']:
poem_vectorized_mean = np.mean([model[poem_w] for poem_w in poem.split() if poem_w in model], axis=0)
final_data.append(poem_vectorized_mean)
Which both will give you:
X = np.asarray(final_data)
print(X)
> [[-3.74696642e-01 3.73661995e-01 4.09943342e-01 -2.07784668e-01
...
-1.85739681e-01 -7.07386672e-01 3.31366658e-01 3.31600010e-01]
[-3.29973340e-01 4.13213342e-01 5.26199996e-01 -2.29261339e-01
...
-1.25366330e-01 -5.87253332e-01 2.80240029e-01 2.56700337e-01]]
Note that attempting to get np.mean()
on an empty list will throw an error so you might want to catch that in case there are poems which are empty or where all words are unknown to the model.