Search code examples
pythonnumpypython-3.6memmap

Numpy memmap in-place sort of a large matrix by column


I'd like to sort a matrix of shape (N, 2) on the first column where N >> system memory.

With in-memory numpy you can do:

x = np.array([[2, 10],[1, 20]])
sortix = x[:,0].argsort()
x = x[sortix]

But that appears to require that x[:,0].argsort() fit in memory, which won't work for memmap where N >> system memory (please correct me if this assumption is wrong).

Can I achieve this sort in-place with numpy memmap?

(assume heapsort is used for sorting and simple numeric data types are used)


Solution

  • The solution may be simple, using the order argument to an in place sort. Of course, order requires fieldnames, so those have to be added first.

    d = x.dtype
    x = x.view(dtype=[(str(i), d) for i in range(x.shape[-1])])
    array([[(2, 10)],
       [(1, 20)]], dtype=[('0', '<i8'), ('1', '<i8')])
    

    The field names are strings, corresponding to the column indices. Sorting can be done in place with

    x.sort(order='0', axis=0)
    

    Then convert back to a regular array with the original datatype

    x.view(d)
    array([[ 1, 20],
       [ 2, 10]])
    

    That should work, although you may need to change how the view is taken depending on how the data is stored on disk, see the docs

    For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the behavior of the view cannot be predicted just from the superficial appearance of a (shown by print(a)). It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.