python numpy parallel-processing gpgpu theano

Speedup for the reduce operation in Theano

Edit:
So sorry, it turns out that I had other processes running on my GPU while I did the test, I've updated the timing results on an idle GPU, and the speedup becomes noticeable for larger matrices.

Original Post:

As posted in this question, L is a list of matrices, where each item M is a x*n matrix (x is a variable, n is fixed).

I want to compute the sum of M'*M for all items in L (M' is the transpose of M) as the following Python code does.

for M in L:
  res += np.dot(M.T, M)

Followings are some examples of Numpy and Theano implementations (for executable script please refer to @DanielRenshaw's answer to the previous question).

def numpy_version1(*L):
    n = L[0].shape[1]
    res = np.zeros((n, n), dtype=L[0].dtype)
    for M in L:
        res += np.dot(M.T, M)
    return res

def compile_theano_version1(number_of_matrices, n, dtype):
    L = [tt.matrix() for _ in xrange(number_of_matrices)]
    res = tt.zeros(n, dtype=dtype)
    for M in L:
        res += tt.dot(M.T, M)
    return theano.function(L, res)

def compile_theano_version2(number_of_matrices, n):
    L = theano.typed_list.TypedListType(tt.TensorType(theano.config.floatX, broadcastable=(None, None)))()
    res, _ = theano.reduce(fn=lambda i, tmp: tmp+tt.dot(L[i].T, L[i]),
                           outputs_info=tt.zeros((n, n), dtype=theano.config.floatX),
                           sequences=[theano.tensor.arange(number_of_matrices, dtype='int64')])
    return theano.function([L], res)

I ran the Numpy version on CPU, and Theano versions on GPU with different settings, it seems that the Theano versions are always proportionally slower than the Numpy version (regardless of the number and size of matices).

But I was expecting there could be some optimization w.r.t GPU as it is a simple reduce operation.

Could someone help me understand what's going on under the hood?

Edit:
Followings are the script (from @DanielRenshaw) for generating data, settings I've tired and results.

L = [np.random.standard_normal(size=(x, n)).astype(dtype)
     for x in range(min_x, number_of_matrices + min_x)]

dtype = 'float32'
theano.config.floatX = dtype
iteration_count = 10
min_x = 20

# base case:
# numpy_version1 0.100589990616
# theano_version1 0.243968963623
# theano_version2 0.198153018951
number_of_matrices = 200
n = 100

# increase matrix size:
# numpy_version1 4.90120816231
# theano_version1 0.984472036362
# theano_version2 3.56008815765
number_of_matrices = 200
n = 1000

# increase number of matrices:
# numpy_version1 5.11445093155
# theano_version1 compilation error
# theano_version2 6.54448604584
number_of_matrices = 2000
n = 100

Solution

The problem you are having, is not the number of matrices, is the size of them.

Your test example, creates matrices of size dependent on the number of matrices you have, thus, the more matrices you have the larger the matrices are, but also the larger the python loop overhead is (in reduce operations), and thus, it makes harder to detect speed improvements.

I've sightly modify your matrix generation in order to make some new tests:

S = 1000 # Size of the matrices
N = 10 # Number of matrices

L = [np.random.standard_normal(size=(np.random.randint(S//2, S*2), S)).astype(np.float32) for _ in range(N)]

This generates only 10 matrices of size (x, 1000) where x is some value in the range of [S//2, S*2] == [500, 2000].

f1 = compile_theano_version1(N, S, np.float32)
f2 = compile_theano_version2(N, S)

Now some tests with N = 10 big matrices:

For S = 1000, N = 10:

 %timeit numpy_version1(*L)   # 10 loops, best of 3: 131 ms per loop
 %timeit f1(*L)               # 10 loops, best of 3: 37.3 ms per loop
 %timeit f2(L)                # 10 loops, best of 3: 68.7 ms per loop

where theano functions have a x4 and x2 speedup in a laptop with a pretty nice i7 and a decent NVIDIA 860M (which means that you should get some nicer speedups here).

For S = 5000, N = 10:

 %timeit numpy_version1(*L)   # 1 loops, best of 3: 4 s per loop
 %timeit f1(*L)               # 1 loops, best of 3: 907 ms per loop
 %timeit f2(L)                # 1 loops, best of 3: 1.77 s per loop

So, overall, with this setup the larger the S the larger the speedup theano gets over the CPU.

Some tests with N = 100 big matrices: theano seems faster

For S = 1000, N = 100:

%timeit numpy_version1(*L)   # 1 loops, best of 3: 1.46 s per loop
%timeit f1(*L)               # 1 loops, best of 3: 408 ms per loop
%timeit f2(L)                # 1 loops, best of 3: 724 s per loop

For S = 2000, N = 100:

%timeit numpy_version1(*L)   # 1 loops, best of 3: 11.3 s per loop
%timeit f1(*L)               # 1 loops, best of 3: 2.72 s per loop
%timeit f2(L)                # 1 loops, best of 3: 4.01 s per loop

Tests with N = 100 small matrices: numpy seems faster

For S = 50, N = 100:

%timeit numpy_version1(*L)   # 100 loops, best of 3: 1.17 ms per loop
%timeit f1(*L)               # 100 loops, best of 3: 4.21 ms per loop
%timeit f2(L)                # 100 loops, best of 3: 7.42 ms per loop

Specifications for the tests:

Processor: i7 4710HQ
GPU: NVIDIA GeForce GTX 860M
Numpy: version 1.10.2 built with intel MKT
Theano: version 0.70; floatX = float32; using GPU