Inconsistent results when training gensim model with gensim.downloader vs manual loading

I am trying to understand what is going wrong in the following example.

To train on the 'text8' dataset as described in the docs, one only has to do the following:

import gensim.downloader as api
from gensim.models import Word2Vec

dataset = api.load('text8')
model = Word2Vec(dataset)

doing this gives very good embedding vectors, as verified by evaluating on a word-similarity task.

However, when loading the same textfile which is used above manually, as in

text_path = '~/gensim-data/text8/text'
text = []
with open(text_path) as file:
    for line in file:
        text.extend(line.split())
text = [text]

model = Word2Vec(test)

The model still says it's training for the same number of epochs as above (5), but training is much faster, and the resulting vectors have a very, very bad performance on the similarity task.

What is happening here? I suppose it could have to do with the number of 'sentences', but the text8 file seems to have only a single line, so does gensim.downloader split the text8 file into sentences? If yes, of which length?

Solution

In your second example, you've created a training dataset with just a single text with the entire contents of the file. That's about 1.1 million word tokens, in a single list.

Word2Vec (& other related algorithms) in gensim have an internal implementation limitation, in their optimized paths, of 10,000 tokens per text item. All additional tokens are ignored.

So, in your 2nd case, 99% of your data is being discarded. Training may seem instant, but very little actual training will have occurred. (Word-vectors for words that only appear past the 1st 10,000 tokens won't have been trained at all, having only their initial randomly-set values.) If you enable logging at the INFO level, you'll see more details about each step of the process, and discrepancies like this may be easier to identify.

Yes, the api.load() variant takes extra steps to break the single-line-file into 10,000-token chunks. I believe it's using the LineSentence utility class for this purpose, whose source can be examined here:

https://github.com/RaRe-Technologies/gensim/blob/e859c11f6f57bf3c883a718a9ab7067ac0c2d4cf/gensim/models/word2vec.py#L1209

However, I recommend avoiding the api.load() functionality entirely. It doesn't just download data; it also downloads a shim of additional outside-of-version-control Python code for prepping that data for extra operations. Such code is harder to browse & less well-reviewed than official gensim release code as packaged for PyPI/etc, which also presents a security risk. Each load target (by name like 'text8') might do something different, leaving you with a different object type as the return value.

It's much better for understanding to directly download precisely the data files you need, to known local paths, and do the IO/prep yourself, from those paths, so you know what steps have been applied, and the only code you're running is the officially versioned & released code.