python gensim word2vec word-embedding corpus

the repetitions in Gensim Word2Vec training corpus

I am using Gensim to train a Word2Vec embedding on different corpora, pertaining to different years, to compare the embedding vectors. My question is: if I repeat the documents of a specific year twice and documents of another year just once, do the resulting embeddings give more weight to the repeated documents? I have in my mind to make a corpus that gives more weight to recent documents and less weight to documents from far past. I simply train the model on my Line Sentence corpus file.

Word2Vec(corpus_file=corpus, vector_size=100, window=5, min_count=5, workers=4)

Solution

Sure, repeating some texts (even more than the re-iterations controlled by the epochs count) means they'll have more influence on the final model.

In general, repeating identical texts isn't as good as truly varied alternative examples of the same words. For example, if you only have one text using a certain word, repeating it 5 times might make the word survive the min_count=5 cutoff, but the lack of many subtly-contrasting appearances means its final vector will only reflect that one peculiar, repeated use. The kind of good relative word-vector positions that people are usually seeking need a training tug-of-war between all the ways a word is used.

But, in this case, you should still have many examples, you're just overtraining on some of them.

Do note that it might be a little bit better to ensure the repeats are shuffled throughout the whole corpus, at least once before training begins, rather than clustered all together. (Repeating a text 10 times in a row will overtrain those words/contexts – but not in as balanced of a way as if interleaved with all the other differently-weighted training.)

And, that you might not want to turn all the 1-occurrence words in a subset that you repeat 5 times to automatically survive the min_count cut - because they still just have that one true weak context example. So you might want to learn the vocabulary from a non-reweighted corpus, but then train on the new corpus with artificial repeats (being sure to provide your .train() call with the right new total_examples count for it to report progress & adjust the learning-rate properly).