Search code examples
pythonstanford-nlpword2vecfasttext

Loading pretrained glove on production with flask and Gunicorn


I have a model that requires some preprocessing using Glove from Stanford. From my experience it takes the at least 20-30 seconds until the Glove is loaded by this code:

glove_pd = pd.read_csv(embed_path+'/glove.6B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove = {key: val.values for key, val in glove_pd.T.items()}

My question is what is the best practice to handle this in a production app? As far as I can understand is that everytime that I restart the server I need to wait 30 seconds until the endpoint is ready.

Also, I have read that when using Gunicorn, it is recommended to run with workers>1, something like this:

ExecStart=/path/to/gunicorn --workers 3 --bind unix:app.sock -m 007 wsgi:app

Does it mean that each instance of gunicorn requires to load the same glove to memory? This means that the server resources will be quite large, let me know if I am correct here.

Bottom line my question is what are the recommended methods for hosting a model that requires an pretrained embedding (glove/word2vec/fasttext) on a production server


Solution

  • At one level, if you need it in memory, and that's how long it takes to read the gigabyte-plus from disk into useful RAM structures, then yes - that's how long it takes before a process is ready to use that data. But there's room for optimizations!

    For example, reading this as 1st a Pandas dataframe, then converting it to a Python dict, involves both more steps & more RAM than other options. (At the momentary peak, when both glove_pd and glove are fully constructed & referenced, you'll have two full copies in memory – and neither is as compact as would be ideal, which could trigger other slowdowns, especially if the bloat triggers using any virtual-memory at all.)

    And as you fear, if 3 gunicorn workers each run the same loading code, 3 separate copies of the same data will be loaded – but there's a way to avoid this, below.

    I'd suggest 1st loading the vectors into a utility class for accessing word-vectors, like the KeyedVectors interface in the Gensim library. It'll store all the vectors in one compact numpy matrix, with a dict-like interface that still returns one numpy ndarray for as each individual vector.

    For example, you can convert GLoVe text-format vectors to a slightly-different interchange format (with an extra header line, that Gensim calls word2vec_format after its use by the original Google word2vec.c code). In gensim-3.8.3 (current release as of August 2020) you can do:

    from gensim.scripts.glove2word2vec import glove2word2vec
    glove2word2vec('glove.6B.300d.txt', 'glove.6B.300d.w2vtxt')
    

    Then, the utility-class KeyedVectors can load them like so:

    from gensim.models import KeyedVectors
    glove_kv = KeyedVectors.load_word2vec_format('glove.6B.300d.w2vtxt', binary=False)
    

    (Starting in the future gensim-4.0.0 release, it should be possible to skip conversion & just use the new no_header argument to read a GLoVe text file directly: glove_kv = KeyedVectors.load_word2vec_format('glove.6B.300d.w2vtxt', binary=False, no_header=True). But this headerless-format will be a little slower, as it requires two passes over the file - the 1st to learn the full size.)

    Loading just once into KeyedVectors should already be faster & more-compact than your original generic two-step process. And, lookups that are analogous to what you were doing on the prior dict will be available on the glove_kv instance. (Also, there are many other convenience operations, like ranked .most_similar() lookup, that utilize efficient array library functions for speed.)

    You can take another step, though, to minimize the parsing-on-load, and to defer loading unneeded ranges of the full set of vectors, and automatically reuse raw array data between processes.

    That extra step is to re-save the vectors using the Gensim instance's .save() function, which will dump the raw vectors into a separate dense file that's suitable for memory-mapping upon the next load. So first:

    glove_kv.save('glove.6B.300d.gs')
    

    This will create more than one file which must be kept together if relocated – but the .npy file(s) saved will be the exact minimal format ready for memory-mapping .

    Then, when needed later, load as:

    glove_kv = KeyedVectors.load('glove.6B.300d.gs', mmap='r')
    

    The mmap argument uses underlying OS mechanisms to simply map the relevant matrix address-space to the (read-only) file(s) on disk, so that the initial 'load' is effectively instant, but any attempt to access ranges of the matrix will use virtual-memory to page-in the right ranges of the file. It thus eliminates any scanning-for-delimiters & defers IO until absolutely needed. (And if there are any ranges you never access? They'll never be loaded.)

    The other big benefit of memory-mapping is that if multiple processes each read-only memory-map the same on-disk files, the OS is smart enough to let them share any common paged-in ranges. So with, say, 3 totally-separate OS processes that each mmap the same file, you get 3X RAM savings.

    (If after all these changes, the lag upon restarting server processes is still an issue – perhaps because the server processes crash or otherwise need restarting often – you could even consider using some other long-lived, stable process to initially mmap the vectors. Then, even the crash of all server processes wouldn't cause the OS to lose any paged-in ranges of the file, and the restart of the server processes might find some or all of the relevant data already in RAM. But the complication of this extra role may be superfluous once the other optimizations are in place.)

    One extra caveat: if you start using KeyedVectors methods like .most_similar() that can (up through gensim-3.8.3) trigger the creation of a full-size cache of the unit-length-normalized word-vectors, you could lose the mmap benefits unless you take some extra steps to short-circuit that process. See more details in prior answer: How to speed up Gensim Word2vec model load time?