Search code examples
pythonnumpynumpy-memmap

Efficiency of random slicing on a numpy memory map


I have a 20GB, 100k x 100k 'float16' 2D array as a datafile. I load it to memory as follows:

fp_read = np.memmap(filename, dtype='float16', mode='r', shape=(100000, 100000))

I then attempt to read slices from it. The vertical slices I need to take are effectively random but the performance is very poor for this, or am I doing something wrong?


Analysis:

I have compared with other forms of cross-sectional slicing, which is much better although I don't know why it should be:

%timeit fp_read[:,17000:17005]    # slice 5 consecutive cols
1.64 µs ± 16.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit fp_read[:,11000:11050:10]
1.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit fp_read[:,5000:6000:200]
1.66 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit fp_read[:,0:100000:20000]    # slice 5 disperse cols
1.69 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit fp_read[:,[1,1001,27009,81008,99100]]     # slice 5 rand cols
32.4 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

a = np.arange(100000); b = np.array([1,1001,27009,81008,99100])
%timeit fp_read[np.ix_(a,b)]
18 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Even these timeit functions don't accurately capture the performance degradation, since:

import time
a = np.arange(100000)
cols = np.arange(100000)
np.random.shuffle(cols)
cols = np.sort(cols[:5])
t = time.time()
arr = fp_read[np.ix_(a,cols)]
print('Actually took: {} seconds'.format(time.time() - t))
Actually took: 24.5 seconds

Compared with:

t = time.time()
arr = fp_read[:,0:100000:20000]
print('Actually took: {} seconds'.format(time.time() - t))
Actually took 0.00024 seconds

Solution

  • The performance difference is explained by one key difference in "basic slicing and indexing" vs. "advanced indexing", see these docs. The key line herein is

    Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).

    How much the copy hurts can be seen from comparing fp_read[:,5000:6000:200] against fp_read[:,5000:6000:200].copy().

    Although making an array copy is always going to be slower than making a new view, it's especially bad for a memmap:

    1. Reading from disk is relatively slow. The data needs to be read from disk to make the (in-memory) copy, while a view doesn't have to read any data at all! There is simply a new ndarray object created with new offset and stepsize (strides) parameters for the memory buffer.
    2. The memory layout of your data is row-major order (vs. columns-major, see wikipedia). For accessing random columns this means that a sector has to be read from disk for every single value of data. Compare that to contiguous access, where you only read one sector for every 256 values (assuming float16 and 512 byte sectors). With memory-mapped io this effect is even worse, because then data is read in blocks (memory pages) of 4kB, so 8 x 512 byte sectors.

    Now we can also understand why the timeit results are not really representative: That particular part of the file is cached by the OS in memory.