Search code examples
doc2vec

Doc2Vec: Do we need to train model with utils.shuffle?


I am new to NLP and Doc2Vec. I noted some website train the Doc2Vec by shuffling the training data in each epoch (option 1), while some website use option 2. In option 2, there is no shuffling of training data

What is the difference? Also how do I select the optimal alpha? Thank you

### Option 1 ###

for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

vs

### Option 2 ###
model_dbow.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=30)


Solution

  • If your corpus might have some major difference-in-character between early & late documents – such as certain words/topics that are all front-loaded to early docs, or all back-loaded in later docs – then performing one shuffle up-front to eliminate any such pattern may help a little. It's not strictly necessary & its effects on end results will likely be small.

    Re-shuffling between every training pass is not common & I wouldn't expect it to offer a detectable benefit justifying its cost/code-complexity.

    Regarding your "Option 1" vs "Option 2": Don't call train() multiple times in your own loop unless you're an expert who knows exactly why you're doing that. (And: any online example suggesting that is often a poor/buggy one.)