I am building my own vocabulary to measure document similarity. I also attached the log of the run.
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 1000
vec_size =50
alpha = 0.025
tic = time.perf_counter()
#Building a model from the tokenized data
model = Doc2Vec(vector_size=vec_size,
alpha=alpha,
min_alpha=0.0025,
min_count=5,
workers =5,
dm =1)
model.build_vocab(tagged_data)
model.train(tagged_data,total_examples=model.corpus_count,epochs=max_epochs)
model.save("d2v.model")
print("Model Saved")
toc = time.perf_counter()
print(f"Time {toc - tic:0.4f} seconds")
Generally due to threading-contention inherent to both the Python 'Global Interpreter Lock' ('GIL'), and the default Gensim master-reader-thread, many-worker-thread approach, the training can't keep all cores mostly-busy with separate threads, once you get past about 8-16 cores.
If you can accept that the only tag
for each text will be its ordinal number in the corpus, the alternate corpus_file
method of specifying the training-data allow arbitrarily many threads to each open their own readers into the (whitespace-token-delimited) plain-text corpus file, achieving much higher core utilization when you have 16+ cores/workers.
See the Gensim docs for the corpus_file
parameter:
https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec
Note though there are some unsolved bugs that hint this mode might mishandle or miss training data at some segmentation boundaries. (This may not be significant in large training data.)
Otherwise, other tweaks to parameters that help Word2Vec
/Doc2Vec
training run faster may be worth trying, such as altered window
, vector_size
, negative
values. (Though note that counterintuitively, when the bottleneck is thread-contention in Gensim's default corpus-iterable mode, some values in these parameters that normally require more computation and thus imply slower training manage to mainly soak up previously-idle contention time, and thus are comparitively 'free'. So when suffering contention, trying more-expensive values for window
/negative
/vector_size
may become more practical.)
Generally, a higher min_count
(discarding more rare words), or a more-aggressive (smaller) sample
value (discarding more of the overrepresented high-frequency words), can also reduce the amount of raw training happening and thus finish training faster, with minimal effect on quality. (Sometimes, more-aggressive sample
values manage to both speed training & improve results on downstream evaluations, by letting the model spend relatively more time on rarer words that are still important downstream.)