Search code examples
callbackgensimmulticoreword-embeddingdoc2vec

How to check via callbacks if alpha is decreasing? + How to load all cores during training?


I'm training doc2vec, and using callbacks trying to see if alpha is decreasing over training time using this code:

class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch.'''

    def __init__(self, path_prefix):
        self.path_prefix = path_prefix
        self.epoch = 0

        os.makedirs(self.path_prefix, exist_ok=True)

    def on_epoch_end(self, model):
        savepath = get_tmpfile(
            '{}_epoch{}.model'.format(self.path_prefix, self.epoch)
        )
        model.save(savepath)
        print(
            "Model alpha: {}".format(model.alpha), 
            "Model min_alpha: {}".format(model.min_alpha),
            "Epoch saved: {}".format(self.epoch + 1),
            "Start next epoch"
        )
        self.epoch += 1


def train():

    workers = multiprocessing.cpu_count()*4
    model = Doc2Vec(
        DocIter(),
        vec_size=600, alpha=0.03, min_alpha=0.00025, epochs=20,
        min_count=10, dm=1, hs=1, negative=0, workers=workers,
        callbacks=[EpochSaver("./checkpoints")]
    )
    print(
        "HS", model.hs, "Negative", model.negative, "Epochs", 
         model.epochs, "Workers: ", model.workers, "Model alpha: 
         {}".format(model.alpha)
    )  

And while training I see that alpha is not changing over time. On each callback I see alpha = 0.03.
Is it possible to check if alpha is decreasing? Or it really not decreasing at all during training?

One more question: How can I benefit from all my cores while training doc2vec?

Loading of cores

As we can see, each core is not loaded more than +-30%.


Solution

  • The model.alpha property only holds the initially-configured starting-alpha – it's not updated to the effective learning-rate through training.

    So, even if the value is being decreased properly (and I expect that it is), you wouldn't see it in the logging you've added.

    Separate observations about your code:

    • in gensim versions at least through 3.5.0, maximum training throughput is most often reached with some value for workers between 3 and the number of cores – but usually not the full number of cores (if it's higher than 12) or larger. So workers=multiprocessing.cpu_count()*4 is likely going to much slower than what you could achieve with a lower number.

    • if your corpus is large enough to support 600-dimensional vectors, and discarding words with fewer than min_count=10 examples, negative sampling may work faster and get better results than the hs mode. (The pattern in published work seems to be to prefer negative-sampling with larger corpuses.)