python machine-learning nlp gensim doc2vec

Document similarity in production environment

We are having n number of documents. Upon submission of new document by user, our goal is to inform him about possible duplication of existing document (just like stackoverflow suggests questions may already have answer).

In our system, new document is uploaded every minute and mostly about the same topic (where there are more chance of duplication).

Our current implementation includes gensim doc2vec model trained on documents (tagged with unique document ids). We infer vector for new document and find most_similar docs (ids) with it. Reason behind choosing doc2vec model is that we wanted to take advantage of semantics to improve results. As far as we know, it does not support online training, so we might have to schedule a cron or something that periodically updates the model. But scheduling cron will be disadvantageous as documents come in a burst. User may upload duplicates while model is not yet trained for new data. Also given huge amount of data, training time will be higher.

So i would like to know how such cases are handled in big companies. Are there any better alternative? or better algorithm for such problem?

Solution

You don't have to take the old model down to start training a new model, so despite any training lags, or new-document bursts, you'll always have a live model doing the best it can.

Depending on how much the document space changes over time, you might find retraining to have a negligible benefit. (One good model, built on a large historical record, might remain fine for inferring new vectors indefinitely.)

Note that tuning inference to use more steps (especially for short documents), or a lower starting alpha (more like the training default of 0.025) may give better results.

If word-vectors are available, there is also the "Word Mover's Distance" (WMD) calculation of document similarity, which might be ever better at identifying close duplicates. Note, though, it can be quite expensive to calculate – you might want to do it only against a subset of likely candidates, or have to add many parallel processors, to do it in bulk. There's another newer distance metric called 'soft cosine similarity' (available in recent gensim) that's somewhere between simple vector-to-vector cosine-similarity and full WMD in its complexity, that may be worth trying.

To the extent the vocabulary hasn't expanded, you can load an old Doc2Vec model, and continue to train() it – and starting from an already working model may help you achieve similar results with fewer passes. But note: it currently doesn't support learning any new words, and the safest practice is to re-train with a mix of all known examples interleaved. (If you only train on incremental new examples, the model may lose a balanced understanding of the older documents that aren't re-presented.)

(If you chief concern is documents that duplicate exact runs-of-words, rather than just similar fuzzy topics, you might look at mixing-in other techniques, such as breaking a document into a bag-of-character-ngrams, or 'shingleprinting' as in common in plagiarism-detection applications.)