Given heavily cleaned input in the format
model_input = [['TWO people admitted fraudulently using bank cards (...)'],
['All tyrants believe forever',
'But history especially People Power (...) first Bulatlat']]
word2vec is returning alongside the more obvious results super-specific vectors such as
{'A pilot shot dogfight Pakistani aircraft returned India Friday freed Islamabad called peace gesture following biggest standoff two countries years':
<gensim.models.keyedvectors.Vocab at 0x12a93572828>,
'This story published content partnership POLITICO':
<gensim.models.keyedvectors.Vocab at 0x12a93572a58>,
'Facebook says none 200 people watched live video New Zealand mosque shooting flagged moderators underlining challenge tech companies face policing violent disturbing content real time':
<gensim.models.keyedvectors.Vocab at 0x12a93572ba8>}
It appears to be occurring to more documents than not, and I have a hard time believing they each appear more than five times.
I'm using the following code to create my model:
TRAIN_EPOCHS = 30
WINDOW = 5
MIN_COUNT = 5
DIMS = 250
vocab_model = gensim.models.Word2Vec(model_input,
size=DIMS,
window=WINDOW,
iter=TRAIN_EPOCHS,
min_count=MIN_COUNT)
What am I doing wrong that I'm getting such useless vectors?
Word2Vec
expects its training corpus – its sentences
argument – to be a re-iterable Python sequence where each item is itself a list-of-words.
Your model_input
list appears to be a list, where each item is itself a list, but where each item in those lists is a full sentence of many words as a string. As a result, where it's expecting individual word-tokens (as strings), you're giving it full untokenized sentences (as strings).
If you break your texts into lists-of-words, and feed a sequence of those lists-of-words to the model as training data, then you'll get vectors for word-tokens, rather than sentence-strings.