I'd like to sort a matrix of shape (N, 2)
on the first column where N
>> system memory.
With in-memory numpy you can do:
x = np.array([[2, 10],[1, 20]])
sortix = x[:,0].argsort()
x = x[sortix]
But that appears to require that x[:,0].argsort()
fit in memory, which won't work for memmap where N
>> system memory (please correct me if this assumption is wrong).
Can I achieve this sort in-place with numpy memmap?
(assume heapsort is used for sorting and simple numeric data types are used)
The solution may be simple, using the order argument to an in place sort. Of course, order
requires fieldnames, so those have to be added first.
d = x.dtype
x = x.view(dtype=[(str(i), d) for i in range(x.shape[-1])])
array([[(2, 10)],
[(1, 20)]], dtype=[('0', '<i8'), ('1', '<i8')])
The field names are strings, corresponding to the column indices. Sorting can be done in place with
x.sort(order='0', axis=0)
Then convert back to a regular array with the original datatype
x.view(d)
array([[ 1, 20],
[ 2, 10]])
That should work, although you may need to change how the view is taken depending on how the data is stored on disk, see the docs
For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the behavior of the view cannot be predicted just from the superficial appearance of a (shown by print(a)). It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.