Search code examples
pythonpython-3.xgensimword2vec

Gensim's word2vec returning awkward vectors


Given heavily cleaned input in the format

model_input = [['TWO people admitted fraudulently using bank cards (...)'],
               ['All tyrants believe forever',
                'But history especially People Power (...) first Bulatlat']]

word2vec is returning alongside the more obvious results super-specific vectors such as

{'A pilot shot dogfight Pakistani aircraft returned India Friday freed Islamabad called peace gesture following biggest standoff two countries years':
     <gensim.models.keyedvectors.Vocab at 0x12a93572828>,
 'This story published content partnership POLITICO':
     <gensim.models.keyedvectors.Vocab at 0x12a93572a58>,
 'Facebook says none 200 people watched live video New Zealand mosque shooting flagged moderators underlining challenge tech companies face policing violent disturbing content real time': 
    <gensim.models.keyedvectors.Vocab at 0x12a93572ba8>}

It appears to be occurring to more documents than not, and I have a hard time believing they each appear more than five times.

I'm using the following code to create my model:

TRAIN_EPOCHS = 30
WINDOW = 5
MIN_COUNT = 5 
DIMS = 250

vocab_model = gensim.models.Word2Vec(model_input,
                                     size=DIMS,
                                     window=WINDOW,
                                     iter=TRAIN_EPOCHS,
                                     min_count=MIN_COUNT)

What am I doing wrong that I'm getting such useless vectors?


Solution

  • Word2Vec expects its training corpus – its sentences argument – to be a re-iterable Python sequence where each item is itself a list-of-words.

    Your model_input list appears to be a list, where each item is itself a list, but where each item in those lists is a full sentence of many words as a string. As a result, where it's expecting individual word-tokens (as strings), you're giving it full untokenized sentences (as strings).

    If you break your texts into lists-of-words, and feed a sequence of those lists-of-words to the model as training data, then you'll get vectors for word-tokens, rather than sentence-strings.