Search code examples
performanceproductiondata-scienceword2vecdimensionality-reduction

Bring Word2Vec models efficiently into Production Service


This is kind of a long shot, but I am hoping that someone has been in a similar situation as I am looking for some advice how to efficiently bring a set of large word2vec models into a production environment.

We have a range of trained w2v models with a dimensionality of 300. Due to the underlying data - huge corpus with POS tagged words; specialized vocabularies with up to 1 mio words - these models became quite large and we are currently looking into effective ways how to expose these to our users w/o paying a too high price in infrastructure.

Besides trying to better control the vocabulary size, obviously, dimensionality reduction on the feature vectors would be an option. Is anyone aware of publications around that, particularly on how this would affect model quality, and how to best measure this?

Another option is to pre-calculate the top X most similar words to each vocabulary word and to provide a lookup table. With the model size being that big, this is currently also very inefficient. Are there any heuristics known that could be used reduce the number of necessary distance calculations from n x n-1 to a lower number?

Thank you very much!


Solution

  • There are pre-indexing techniques for similarity-search in high-dimensional spaces which can speed nearest-neighbor discovery, but usually at a cost of absolute accuracy. (They also need more memory for the index.)

    An example is the ANNOY library. The gensim project includes a demo notebook showing its use with Word2Vec.

    I once did some experiments using just 16-bit (rather than 32-bit) floats in a Word2Vec model. It saved memory in the idle state, and nearest-neighbor top-N results were nearly unchanged. But, perhaps because some behind-the-scenes up-conversion to 32-bit floats was still occurring during the one-against-all distance-calculations, speed of operations was actually reduced. (And this suggests that each distance-calculation may have caused a temporary memory expansion offsetting any idle-state savings.) So it's not a quick fix, but further research here – perhaps involving finding/implementing the right routines for float16 array operations – could maybe mean 50% model-size savings and equivalent or even better speed.

    For many applications, discarding the least-frequent words doesn't hurt much – or even, when done before training, can improve the quality of the remaining vectors. As many implementations, including gensim, sort the word-vector array in most-to-least-frequent order, you can discard the tail-end of the array to save memory, or limit most_similar() searches to the first-N entries to speed calculations.

    Once you've minimized the vocabulary size, you want to be sure the full set is in RAM, and no swapping is triggered during the (typical) full-sweep distance-calculations. If you need multiple processes to serve answers from the same vector set, as in a web service on a multicore machine, gensim's memory-mapping operations can prevent each process from loading its own redundant copy of the vectors. You can see a discussion of this technique in this answer about speeding gensim Word2Vec loading time.

    Finally, while precomputing top-N neighbors for a larger vocabulary is both time-consuming and memory-intensive, if your pattern of access is such that some tokens are checked far more than others, a cache of the N most-recently or M most-frequently requested top-N could improve perceived performance a lot – making only less-frequently-requested neighbor-lists require the full distance calculations to every other token.