I am trying to multiply sub-matrix on a sub-vector. It seems that such multiplication should be faster that multiplication of a whole matrix on a whole vector, but time measurements say opposite:
B = np.random.randn(26200, 2000)
h = np.random.randn(2000)
%time z = B @ h
CPU times: user 56 ms, sys: 4 ms, total: 60 ms
Wall time: 29.4 ms
%time z = B[:, :256] @ h[:256]
CPU times: user 44 ms, sys: 28 ms, total: 72 ms
Wall time: 54.5 ms
Results with %timeit:
%timeit z = B @ h
100 loops, best of 3: 18.8 ms per loop
%timeit z = B[:, :256] @ h[:256]
10 loops, best of 3: 38.2 ms per loop
Running it again:
%timeit z = B @ h
10 loops, best of 3: 18.7 ms per loop
%timeit z = B[:, :256] @ h[:256]
10 loops, best of 3: 36.8 ms per loop
May be there are some effective way to do it with numpy, or may be I need to use for example tenserflow to make this slicing effective?
It's a problem of memory layout and time access. By default, arrays are stored line by line like in C (order='C')
. You can store your data column by column like in Fortran (order='F'
), more compatible with your restricted problem, since you select only few columns.
Ilustration :
In [107]: BF=np.asfortranarray(B)
In [108]: np.equal(B,BF).all()
Out[108]: True
In [110]: %timeit B@h
78.5 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [111]: %timeit BF@h
89.3 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [112]: %timeit B[:,:256]@h[:256]
150 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [113]: %timeit BF[:,:256]@h[:256]
10.5 ms ± 893 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This way time execution follows size.