Search code examples
pythonnlpcluster-analysisgensim

How do I use gensim to vectorize these words in my dataframe so I can perform clustering on them?


I am trying to do a clustering analysis (preferably k-means) of poetry words on a pandas dataframe. I am firstly trying to vectorize the words by using the word-to-vector feature in the gensim package. However, the vectors just come out with 0s, so my code is failing to translate the words into vectors. As a result, the clustering doesn't work. Here is my code:

# create a gensim model 
model = gensim.models.Word2Vec(vector_size=100) 
# copy original pandas dataframe with poems
data = poems.copy(deep=True)
# get data ready for kmeans clustering
final_data = [] # empty list 
for i, row in data.iterrows(): 
    poem_vectorized = [] 
    poem = row['Main_text']
    poem_all_words = poem.split(sep=" ")
    for poem_w in poem_all_words: #iterate through list of words 
        try:
            poem_vectorized.append(list(model.wv[poem_w]))
        except Exception as e:
            pass
    try:
        poem_vectorized = np.asarray(poem_vectorized)
        poem_vectorized_mean = list(np.mean(poem_vectorized, axis=0))
    except Exception as e:
        poem_vectorized_mean = list(np.zeros(100))
        pass
    try:
        len(poem_vectorized_mean)
    except:
        poem_vectorized_mean = list(np.zeros(100))
    temp_row = np.asarray(poem_vectorized_mean)
    final_data.append(temp_row)
X = np.asarray(final_data)
print(X)

Output

At closer inspection of:

poem_vectorized.append(list(model.wv[poem_w]))

the problem seems to be this: syntax error


Solution

  • If I understand it correctly you want to use an existing model to get the semantic embeddings of the tokens and then cluster the words, right?

    Because the way you set the model up you are preparing a new model for training, but then don't feed any training data to it and train it, so your model doesn't know any words and just always throws a KeyError when calling model.wv[poem_w].

    Use gensim.downloader to load an existing model (check out their repository for a list of all available models):

    import gensim.downloader as api
    import numpy as np
    import pandas
    
    poems = pandas.DataFrame({"Main_text": ["This is a sample poem.", "This is another sample poem."]})
    model = api.load("glove-wiki-gigaword-100")
    

    Then use it to retrieve the vectors for all words the models knows:

    final_data = []
    for poem in poems['Main_text']:
        poem_all_words = poem.split()
        poem_vectorized = []
        for poem_w in poem_all_words:
            if poem_w in model:
                poem_vectorized.append(model[poem_w])
        poem_vectorized_mean = np.mean(poem_vectorized, axis=0)
        final_data.append(poem_vectorized_mean)
    

    Or as list comprehension:

    final_data = []
    for poem in poems['Main_text']:
        poem_vectorized_mean = np.mean([model[poem_w] for poem_w in poem.split() if poem_w in model], axis=0)
        final_data.append(poem_vectorized_mean)
    

    Which both will give you:

    X = np.asarray(final_data)
    print(X)
    > [[-3.74696642e-01  3.73661995e-01  4.09943342e-01 -2.07784668e-01
        ...
        -1.85739681e-01 -7.07386672e-01  3.31366658e-01  3.31600010e-01]
       [-3.29973340e-01  4.13213342e-01  5.26199996e-01 -2.29261339e-01
        ...
        -1.25366330e-01 -5.87253332e-01  2.80240029e-01  2.56700337e-01]]
    

    Note that attempting to get np.mean() on an empty list will throw an error so you might want to catch that in case there are poems which are empty or where all words are unknown to the model.