Search code examples
pythonpython-3.xtokenizegensimword2vec

Word2Vec vocab results in just letters and symbols


I'm new to Word2Vec and I am trying to cluster words based on their similarity. To start I am using nltk to separate the sentences and then using the resulting list of sentences as the input into Word2Vec. However, when I print the vocab, it is just a bunch of letters, numbers and symbols rather than words. To be specific, an example of one of the letters is "< gensim.models.keyedvectors.Vocab object at 0x00000238145AB438>, 'L':"

# imports needed and logging
import gensim
from gensim.models import word2vec
import logging

import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
with open('C:\\Users\\Freddy\\Desktop\\Thesis\\Descriptions.txt','r') as f_open:
    text = f_open.read()
arr = []

sentences = nltk.sent_tokenize(text) # this gives a list of sentences

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)

model = word2vec.Word2Vec(sentences, size = 300)

print(model.wv.vocab)

Solution

  • As the tutorial and the documentation for Word2Vec class suggests the constructor of the class requires list of lists of words as the first parameter (or iterator of iterators of words in general):

    sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora,...

    I believe before feeding in sentences into Word2Vec you need to use words_tokenize on each of the sentences changing the crucial line to:

    sentences = [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text)]
    

    TL;DR

    You get letters as your "words" because Word2Vec treats strings corresponding to sentences as iterables containing words. Iterating over strings results in the sequence of letters. These letters are used as the basis for the model learning (instead of intended words).

    As the ancient saying goes: trash in - trash out.