Search code examples
pythonmmapgensimword2vec

Sharing memory for gensim's KeyedVectors objects between docker containers


Following related question solution I created docker container which loads GoogleNews-vectors-negative300 KeyedVector inside docker container and load it all to memory

KeyedVectors.load(model_path, mmap='r')
word_vectors.most_similar('stuff')

Also I have another Docker container which provides REST API which loads this model with

KeyedVectors.load(model_path, mmap='r')

And I observe that fully loaded container takes more than 5GB of memory and each gunicorn worker takes 1.7 GB of memory.

CONTAINER ID        NAME                        CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
acbfd080ab50        vectorizer_model_loader_1   0.00%               5.141GiB / 15.55GiB   33.07%              24.9kB / 0B         32.9MB / 0B         15
1a9ad3dfdb8d        vectorizer_vectorizer_1     0.94%               1.771GiB / 15.55GiB   11.39%              26.6kB / 0B         277MB / 0B          17

However, I expect that all this processes share same memory for KeyedVector, so it only takes 5.4 GB shared between all containers.

Have someone tried to achieve that and succeed?

edit: I tried following code snippet and it indeed share same memory across different containers.

import mmap
from threading import Semaphore

with open("data/GoogleNews-vectors-negative300.bin", "rb") as f:
    # memory-map the file, size 0 means whole file
    fileno = f.fileno()
    mm = mmap.mmap(fileno, 0, access=mmap.ACCESS_READ)
    # read whole content
    mm.read()
    Semaphore(0).acquire()
    # close the map
    mm.close()

So the problem that KeyedVectors.load(model_path, mmap='r') don't share memory

edit2: Studying gensim's source code I see that np.load(subname(fname, attrib), mmap_mode=mmap) is called to open memmaped file. Following code sample shares memory across multiple container.

from threading import Semaphore

import numpy as np

data = np.load('data/native_format.bin.vectors.npy', mmap_mode='r')
print(data.shape)
# load whole file to memory
print(data.mean())
Semaphore(0).acquire()

Solution

  • After extensive debugging I figured out that mmap works as expected for numpy arrays in KeyedVectors object.

    However, KeyedVectors have other attributes like self.vocab, self.index2word and self.index2entity which are not shared and consumes ~1.7 GB of memory for each object.