Search code examples
pythonmatrixpytablesnumpy-memmap

Assigning values to list slices of large dense square matrices (Python)


I'm dealing with large dense square matrices of size NxN ~(100k x 100k) that are too large to fit into memory.

After doing some research, I've found that most people handle large matrices by either using numpy's memap or the pytables package. However, I've found that these packages seem to have major limitations. Neither of them seem to offer support ASSIGN values to list slices to the matrix on the disk along more than one dimension.

I would like to look for an efficient way to assign values to a large dense square matrix M with something like:

M[0, [1,2,3], [8,15,30]] = np.zeros((3, 3)) # or

M[0, [1,2,3,1,2,3,1,2,3], [8,8,8,15,15,15,30,30,30]] = 0 # for memmap

  • With memmap, the expression M[0, [1,2,3], [8,15,30]] would always copy the slice into RAM hence assignment doesn't seem to work.
  • With pytables, list slicing along more than 1 dimension is not supported. Currently I'm just slicing along 1 dimension following by the other dimension (i.e. M[0, [1,2,3]][:, [8,15,30]]). RAM usage of this solution would scale with N, which is better than dealing with the whole array (N^2) but is still not ideal.

  • In addition, it appears that pytables isn't the most efficient way of handling matrices with lots of rows. (or could there be a way of specifying the chunksize to get rid of this message?) I am getting the following warning message:

The Leaf ``/M`` is exceeding the maximum recommended rowsize (104857600 bytes);
be ready to see PyTables asking for *lots* of memory and possibly slow
I/O.  You may want to reduce the rowsize by trimming the value of
dimensions that are orthogonal (and preferably close) to the *main*
dimension of this leave.  Alternatively, in case you have specified a
very small/large chunksize, you may want to increase/decrease it.

I'm just wonder whether there are better solutions to assign values to arbitrary 2d slices of large matrices?


Solution

  • First of all, note that in numpy (not sure about pytables) M[0, [1,2,3], [8,15,30]] will return an array of shape (3,) corresponding to elements M[0,1,8], M[0,2,15] and M[0,3,30], so assigning np.zeros((3,3)) to that will raise an error.

    Now, the following works fine with me:

    np.save('M.npy', np.random.randn(5,5,5))  # create some dummy matrix
    M = np.load('M.npy', mmap_mode='r+')  # load such matrix as a memmap
    M[[0,1,2],[1,2,3],[2,3,4]] = 0
    M.flush()  # make sure thing is updated on disk
    del M
    M = np.load('M.npy', mmap_mode='r+')  # re-load matrix
    print(M[[0,1,2],[1,2,3],[2,3,4]])  # should show array([0., 0., 0.])