Search code examples
pythongensimdoc2vecrecommendation-engine

How to get similarity score for unseen documents using Gensim Doc2Vec model?


I have trained a gensim doc2vec model for an English news recommender system. the model was trained with 40K news data. I am using the code below to recommend the top 5 most similar news for e.g. news_1:

inferred_vector = model.infer_vector(news_1)
sims = model.dv.most_similar([inferred_vector], topn=5)

The problem is that if I add another 100 news data to the database(so our database will have 40K + 100 news data now) and re-run the same code, the code will only be able to recommend news based on the original 40K(instead of 40K + 100) to me, in another word, the recommended articles will never come from the 100 articles.

how can I address this issue without the need to retrain the model? Thank you in advanced!

Ps: As our APP is for news, so everyday we'll have lots of news data coming into our database, so we won't consider to retrain the model everyday(doing so may crash our backend server).


Solution

  • There's a bulk contiguous vector structure initially created by training, for the initial known set of vectors. It's amenable to the every-candidate bulk vector calculation at the heart of most_similar() - so that operation goes about as fast as it can, with the right vector libraries for your OS/processor.

    But, that structure wasn't originally designed with incremental expansion in mind. Indeed, if you have 1 million vectors in a dense array, then want to add 1 to the end, the straightforward approach requires you to allocate a new 1-million-and-1 long array, bulk copy over the 1 million, then add the last 1. That works, but what seems like a "tiny" operation then takes a while, and ever-longer as the structure grows. And, each add more-than-doubles the temporary memory usage, for the bulk copy. So, the naive pattern of adding a whole bunch of new items individuall in a loop can be really slow & memory-intensive.

    So, Gensim hasn't yet focused on providing a set-of-vectors that's easy & efficient to incrementally grow with new vectors. But, it's still indirectly possible, if you understand the caveats.

    Especially in gensim-4.0.0 & above, the .dv set of doc-vectors is an instance of KeyedVectors with all that class's standard functions. Thos include the add_vector() and add_vectors() methods:

    https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vector

    https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.add_vectors

    You can try these methods to add your new inferred vectors to the model.dv object - and then they'll also be ncluded in folloup most_similar() results.

    But keep in mind:

    1. The above caveats about performance & memory-usage - which may be minor concerns as long as your dataset isn't too large, or manageable if you do additions in occasional larger batches.

    2. The containing Doc2Vec model generally isn't expecting its internal .dv to be arbitrarily modified or expanded by other code. So, once you start doing that, parts of the model may not behave as expected. If you have problems with this, you could consider saving-aside the full Doc2Vec model before any direct-tampering with its .dv, and/or only expanding a completely separate instance of the doc-vectors, for example by saving them aside (eg: model.dv.save(DOC_VECS_FILENAME)) & reloading them into a separate KeyedVectors (eg: growing_docvecs = KeyedVectors.load(DOC_VECS_FILENAME)).