Why mmap flag reduces memory consumption for single Word2Vec instance

According to the docs and wikipedia:

mmap allows processes to share same chunk of ram

word_vectors = KeyedVectors.load(config.get(wv_file))

This model loaded like this takes ~2.2 GB ram

word_vectors = KeyedVectors.load(config.get(wv_file), mmap='r')

This model loaded like this takes ~1.2 GB ram

Why am I observing such drastic decrease in ram consumption?

Loading multiple models simultaneously, works as expected and models share the ~1 GM memory.

Solution

Memory-mapping re-uses the operating system's virtual-memory functionality to use the existing file as the backing-source for a range of addressable memory.

With a single process, it won't necessarily save any memory. Instead, it just:

Delays loading any range-of-the-addresses into RAM, leaving it on disk until requested. If it's never requested, then RAM is never used, so in that particular case it may "save" memory.
Allows those loaded ranges to be cheaply discarded if they're not accessed for a while, and the RAM is required for other allocations – because those ranges can be reloaded on demand from disk if ever again needed. So it might "save" memory in that case, compared to exhausting RAM or activating other generic virtual-memory that's not aware of the 1:1 relationship with an existing disk file. (Without memory mapping, seldom-used ranges of material in RAM could get written-out to a separate swap file to free space for other allocations – which a wasteful operation, and redundant data, when the data already exists on disk somewhere.)

Unfortunately, in the common-case of a single-process, and typical operations like a most_similar() which necessarily computes on every single vector, the whole structure will be brought into memory on each most_similar(). There's no net RAM "savings" there (though perhaps a slight CPU/IO benefit if other memory pressure would've forced paging-out the loaded ranges). (Whatever approach you're using the sample the "~2.2 GB" and "~1.2 GB" used-RAM values may not be properly measuring that.)

The main benefit is when using multiple processes that each need to consult the same file's data. If naively loaded into RAM, each process will have its own redundant copy of the same data. If using memory-mapping, you've let the OS know: these multiple arrays-in-address-space, in multiple separate processed, definitionally have the same data (as reflected in the file). No matter how many processes need the data, only one copy of each file-range will ever consume RAM. There, a large savings can be achieved.