Numpy's memmap acting strangely?

I am dealing with large numpy arrays and I am trying out memmap as it could help.

big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)

The above works fine and it creates a file on my hard drive of about 140GB. 1000000 is just a random number I used - not the one I am actually using.

I want to fill the matrix with values. Currently it is just set to zero.

for i in tqdm(range(len(big_matrix))):
    modified_row = get_row(i) 
    big_matrix[i, :] = modified_row

At this point now, I have a big_matrix filled with the values I want. The problem is that from this point on I can't operate on this memmap.

For example I want to multiply column wise (broadcast).

I run this:

big_matrix * weights[:, np.newaxis]

Where weights has the same length.

It just hangs and throws and out of memory error as my RAM and SWAP is all used. My understanding was that the memmap will keep everything on the hard drive. For example save the results directly there.

So I tried this then:

for i in tqdm(range(big_matrix.shape[1])):
    temp = big_matrix[:, i].tolist()
    temp = np.array(temp) * weights

The above loads only 1 column in memory, and multiply that with the weights. Then I will save that column back in big_matrix.

But even with 1 column my program hangs. The only difference here is that the RAM is not used up.

At this point I am thinking of switching to sqlite.

I wanted to get some insights why my code is not working? Do I need to flush the memmap everytime I change it ?

Solution

np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.

big_matrix * weights[:, np.newaxis] It just hangs and throws and out of memory error as my RAM and SWAP is all used

This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).

But even with 1 column my program hangs. The only difference here is that the RAM is not used up.

This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.

If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.