I have a 20GB, 100k x 100k 'float16' 2D array as a datafile. I load it to memory as follows:
fp_read = np.memmap(filename, dtype='float16', mode='r', shape=(100000, 100000))
I then attempt to read slices from it. The vertical slices I need to take are effectively random but the performance is very poor for this, or am I doing something wrong?
Analysis:
I have compared with other forms of cross-sectional slicing, which is much better although I don't know why it should be:
%timeit fp_read[:,17000:17005] # slice 5 consecutive cols
1.64 µs ± 16.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,11000:11050:10]
1.67 µs ± 21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,5000:6000:200]
1.66 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,0:100000:20000] # slice 5 disperse cols
1.69 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit fp_read[:,[1,1001,27009,81008,99100]] # slice 5 rand cols
32.4 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = np.arange(100000); b = np.array([1,1001,27009,81008,99100])
%timeit fp_read[np.ix_(a,b)]
18 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Even these timeit functions don't accurately capture the performance degradation, since:
import time
a = np.arange(100000)
cols = np.arange(100000)
np.random.shuffle(cols)
cols = np.sort(cols[:5])
t = time.time()
arr = fp_read[np.ix_(a,cols)]
print('Actually took: {} seconds'.format(time.time() - t))
Actually took: 24.5 seconds
Compared with:
t = time.time()
arr = fp_read[:,0:100000:20000]
print('Actually took: {} seconds'.format(time.time() - t))
Actually took 0.00024 seconds
The performance difference is explained by one key difference in "basic slicing and indexing" vs. "advanced indexing", see these docs. The key line herein is
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
How much the copy hurts can be seen from comparing fp_read[:,5000:6000:200]
against fp_read[:,5000:6000:200].copy()
.
Although making an array copy is always going to be slower than making a new view, it's especially bad for a memmap:
Now we can also understand why the timeit results are not really representative: That particular part of the file is cached by the OS in memory.