Search code examples
pythongensimdoc2vec

How to get Doc2Vec to run faster with a CPU count of 40?


I am building my own vocabulary to measure document similarity. I also attached the log of the run.

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

max_epochs = 1000
vec_size =50
alpha = 0.025

tic = time.perf_counter()
#Building a model from the tokenized data

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.0025,
                min_count=5,
                workers =5,
                dm =1)
  
model.build_vocab(tagged_data)


model.train(tagged_data,total_examples=model.corpus_count,epochs=max_epochs)



model.save("d2v.model")
print("Model Saved")

toc = time.perf_counter()
print(f"Time {toc - tic:0.4f} seconds")

Log of Doc2Vec


Solution

  • Generally due to threading-contention inherent to both the Python 'Global Interpreter Lock' ('GIL'), and the default Gensim master-reader-thread, many-worker-thread approach, the training can't keep all cores mostly-busy with separate threads, once you get past about 8-16 cores.

    If you can accept that the only tag for each text will be its ordinal number in the corpus, the alternate corpus_file method of specifying the training-data allow arbitrarily many threads to each open their own readers into the (whitespace-token-delimited) plain-text corpus file, achieving much higher core utilization when you have 16+ cores/workers.

    See the Gensim docs for the corpus_file parameter:

    https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec

    Note though there are some unsolved bugs that hint this mode might mishandle or miss training data at some segmentation boundaries. (This may not be significant in large training data.)

    Otherwise, other tweaks to parameters that help Word2Vec/Doc2Vec training run faster may be worth trying, such as altered window, vector_size, negative values. (Though note that counterintuitively, when the bottleneck is thread-contention in Gensim's default corpus-iterable mode, some values in these parameters that normally require more computation and thus imply slower training manage to mainly soak up previously-idle contention time, and thus are comparitively 'free'. So when suffering contention, trying more-expensive values for window/negative/vector_size may become more practical.)

    Generally, a higher min_count (discarding more rare words), or a more-aggressive (smaller) sample value (discarding more of the overrepresented high-frequency words), can also reduce the amount of raw training happening and thus finish training faster, with minimal effect on quality. (Sometimes, more-aggressive sample values manage to both speed training & improve results on downstream evaluations, by letting the model spend relatively more time on rarer words that are still important downstream.)