Edit:
So sorry, it turns out that I had other processes running on my GPU while I did the test, I've updated the timing results on an idle GPU, and the speedup becomes noticeable for larger matrices.
Original Post:
As posted in this question, L
is a list of matrices, where each item M
is a x*n
matrix (x
is a variable, n
is fixed).
I want to compute the sum of M'*M
for all items in L
(M'
is the transpose of M
) as the following Python code does.
for M in L:
res += np.dot(M.T, M)
Followings are some examples of Numpy and Theano implementations (for executable script please refer to @DanielRenshaw's answer to the previous question).
def numpy_version1(*L):
n = L[0].shape[1]
res = np.zeros((n, n), dtype=L[0].dtype)
for M in L:
res += np.dot(M.T, M)
return res
def compile_theano_version1(number_of_matrices, n, dtype):
L = [tt.matrix() for _ in xrange(number_of_matrices)]
res = tt.zeros(n, dtype=dtype)
for M in L:
res += tt.dot(M.T, M)
return theano.function(L, res)
def compile_theano_version2(number_of_matrices, n):
L = theano.typed_list.TypedListType(tt.TensorType(theano.config.floatX, broadcastable=(None, None)))()
res, _ = theano.reduce(fn=lambda i, tmp: tmp+tt.dot(L[i].T, L[i]),
outputs_info=tt.zeros((n, n), dtype=theano.config.floatX),
sequences=[theano.tensor.arange(number_of_matrices, dtype='int64')])
return theano.function([L], res)
I ran the Numpy version on CPU, and Theano versions on GPU with different settings, it seems that the Theano versions are always proportionally slower than the Numpy version (regardless of the number and size of matices).
But I was expecting there could be some optimization w.r.t GPU as it is a simple reduce operation.
Could someone help me understand what's going on under the hood?
Edit:
Followings are the script (from @DanielRenshaw) for generating data, settings I've tired and results.
L = [np.random.standard_normal(size=(x, n)).astype(dtype)
for x in range(min_x, number_of_matrices + min_x)]
dtype = 'float32'
theano.config.floatX = dtype
iteration_count = 10
min_x = 20
# base case:
# numpy_version1 0.100589990616
# theano_version1 0.243968963623
# theano_version2 0.198153018951
number_of_matrices = 200
n = 100
# increase matrix size:
# numpy_version1 4.90120816231
# theano_version1 0.984472036362
# theano_version2 3.56008815765
number_of_matrices = 200
n = 1000
# increase number of matrices:
# numpy_version1 5.11445093155
# theano_version1 compilation error
# theano_version2 6.54448604584
number_of_matrices = 2000
n = 100
The problem you are having, is not the number of matrices, is the size of them.
Your test example, creates matrices of size dependent on the number of matrices you have, thus, the more matrices you have the larger the matrices are, but also the larger the python loop overhead is (in reduce operations), and thus, it makes harder to detect speed improvements.
I've sightly modify your matrix generation in order to make some new tests:
S = 1000 # Size of the matrices
N = 10 # Number of matrices
L = [np.random.standard_normal(size=(np.random.randint(S//2, S*2), S)).astype(np.float32) for _ in range(N)]
This generates only 10 matrices of size (x, 1000)
where x
is some value in the range of [S//2, S*2] == [500, 2000]
.
f1 = compile_theano_version1(N, S, np.float32)
f2 = compile_theano_version2(N, S)
N = 10
big matrices:For S = 1000
, N = 10
:
%timeit numpy_version1(*L) # 10 loops, best of 3: 131 ms per loop
%timeit f1(*L) # 10 loops, best of 3: 37.3 ms per loop
%timeit f2(L) # 10 loops, best of 3: 68.7 ms per loop
where theano functions have a x4
and x2
speedup in a laptop with a pretty nice i7
and a decent NVIDIA 860M (which means that you should get some nicer speedups here).
For S = 5000
, N = 10
:
%timeit numpy_version1(*L) # 1 loops, best of 3: 4 s per loop
%timeit f1(*L) # 1 loops, best of 3: 907 ms per loop
%timeit f2(L) # 1 loops, best of 3: 1.77 s per loop
So, overall, with this setup the larger the S
the larger the speedup theano gets over the CPU.
N = 100
big matrices: theano seems fasterFor S = 1000
, N = 100
:
%timeit numpy_version1(*L) # 1 loops, best of 3: 1.46 s per loop
%timeit f1(*L) # 1 loops, best of 3: 408 ms per loop
%timeit f2(L) # 1 loops, best of 3: 724 s per loop
For S = 2000
, N = 100
:
%timeit numpy_version1(*L) # 1 loops, best of 3: 11.3 s per loop
%timeit f1(*L) # 1 loops, best of 3: 2.72 s per loop
%timeit f2(L) # 1 loops, best of 3: 4.01 s per loop
N = 100
small matrices: numpy seems fasterFor S = 50
, N = 100
:
%timeit numpy_version1(*L) # 100 loops, best of 3: 1.17 ms per loop
%timeit f1(*L) # 100 loops, best of 3: 4.21 ms per loop
%timeit f2(L) # 100 loops, best of 3: 7.42 ms per loop
Specifications for the tests: