I am trying to use vectorization (openmp simd) in order to speed up matrix multiplication. In order to make use of vectorization, I transpose the second matrix (to make the fastest-changing index go over contiguous memory). I'm running my tests on 3000 x 3000 arrays. Because I wasn't able to measure a difference in wall time when having vs not having the open mp pragma I wanted to confirm that I am actually getting a speedup for individual arrays that I am multiplying (which wasn't the case). Because of that, I inserted some dummy arrays of the same size in order to check if get a speedup with SIMD on them which I did (at least when using the reduction clause).
Now my question is what is my problem that is hindering the SIMD speedup my only guess is that it must be the two-dimensionality of the array but I don't fully get why that would cause a slowdown.
Or is there another problem with my code that I don't see?
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <time.h>
const int N = 3000;
struct timespec begin, end;
double **create_a() {
double *a = (double *)aligned_alloc(32, sizeof(double) * N * N);
double **a_indexed = (double **)aligned_alloc(32, sizeof(double *) * N);
for (int i = 0; i<N; i++){
a_indexed[i] = a+i*N;
}
for (int i = 0; i< N; i++) {
for (int j = 0; j<N; j++) {
a_indexed[i][j] = i * j;
}
}
return a_indexed;
}
double **create_b(){
double *b = (double *)aligned_alloc(32, sizeof(double) * N * N);
double **b_indexed = (double **)aligned_alloc(32, sizeof(double *) * N);
for (int i = 0; i<N; i++){
b_indexed[i] = b+i*N;
}
for (int i = 0; i< N; i++) {
for (int j = 0; j<N; j++) {
b_indexed[i][j] = (i == j) ? 1:0;
}
}
return b_indexed;
}
double **transpose( double **matrix) {
double *t = (double *)aligned_alloc(32, sizeof(double) * N * N);
double **t_indexed = (double **)aligned_alloc(32, sizeof(double *) * N);
for (int i = 0; i<N; i++){
t_indexed[i] = t+i*N;
}
for (int i = 0; i< N; i++) {
for (int j = 0; j<N; j++) {
t_indexed[i][j] = matrix[j][i];
}
}
return t_indexed;
}
double **mul(double **a, double **b) {
double **b_t = transpose(b);
double *res = (double *)aligned_alloc(32, sizeof(double) * N * N);
double **res_indexed = (double **)aligned_alloc(32, sizeof(double *) * N);
for (int i = 0; i<N; i++){
res_indexed[i] = res+i*N;
}
for (int row = 0; row< 1; row++) {
for (int col = 0; col < 1; col++) {
double cell_res = 0;
// problematic calculation that I can't get to speed up no matter what pragma I use
clock_gettime(CLOCK_REALTIME, &begin);
#pragma omp simd aligned(a, b_t:32) reduction(+:cell_res)
for (int i = 0; i < N; i++) {
cell_res += a[0][i] * b_t[0][i];
}
clock_gettime(CLOCK_REALTIME, &end);
long seconds = end.tv_sec - begin.tv_sec;
long nanoseconds = end.tv_nsec - begin.tv_nsec;
double elapsed = seconds + nanoseconds*1e-9;
printf("Time simd reduce measured: %.9f seconds.\n", elapsed);
// dummy array measurements
struct timespec begin2, end2;
struct timespec begin3, end3;
struct timespec begin4, end4;
double *a2 = (double *)aligned_alloc(32, sizeof(double) * N);
double *b2 = (double *)aligned_alloc(32, sizeof(double) * N);
for (int i = 0; i < N ; i++){
a2[i] = 1;
b2[i] = 1;
}
// measurement with reduction is significantly faster than others
clock_gettime(CLOCK_REALTIME, &begin2);
#pragma omp simd aligned(a2, b2:32) reduction(+:cell_res)
for (int i = 0; i < N; i++) {
cell_res += a2[i] * b2[i];
}
clock_gettime(CLOCK_REALTIME, &end2);
long seconds2 = end2.tv_sec - begin2.tv_sec;
long nanoseconds2 = end2.tv_nsec - begin2.tv_nsec;
double elapsed2 = seconds2 + nanoseconds2*1e-9;
printf("time2 (simd reduction): %.9f seconds.\n", elapsed2);
// no speedup compared to without simd (slightly faster than problematic calculation)
clock_gettime(CLOCK_REALTIME, &begin3);
#pragma omp simd aligned(a2, b2:32)
for (int i = 0; i < N; i++) {
cell_res += a2[i] * b2[i];
}
clock_gettime(CLOCK_REALTIME, &end3);
long seconds3 = end3.tv_sec - begin3.tv_sec;
long nanoseconds3 = end3.tv_nsec - begin3.tv_nsec;
double elapsed3 = seconds3 + nanoseconds3*1e-9;
printf("time3 (simd): %.9f seconds.\n", elapsed3);
// no pragma (slightly faster than problematic calculation)
clock_gettime(CLOCK_REALTIME, &begin4);
for (int i = 0; i < N; i++) {
cell_res += a2[i] * b2[i];
}
clock_gettime(CLOCK_REALTIME, &end4);
long seconds4 = end4.tv_sec - begin4.tv_sec;
long nanoseconds4 = end4.tv_nsec - begin4.tv_nsec;
double elapsed4 = seconds4 + nanoseconds4*1e-9;
printf("time4: %.9f seconds.\n", elapsed4);
res_indexed[0][0] = cell_res;
}
}
return res_indexed;
}
int main (int argc, char **argv) {
//init a(i,j) = i * j
double **a = create_a();
//init b(i,j) = (i == j) ? 1:0;
double **b = create_b();
//multiply
double **res = mul(a,b);
}
Time simd reduce measured: 0.000004188 seconds. // problematic
time2 (simd reduction): 0.000001762 seconds. // faster
time3 (simd): 0.000003475 seconds. //slightly faster
time4: 0.000003476 seconds. //slightly faster
I have tested the first two loops on my machine, and I could reproduce the same behavior.
Time simd reduce measured: 0.000006000 seconds.
time2 (simd reduction): 0.000004000 seconds.
My guess is that there are two issues:
First Problem:
The difference among execution times of the multiple version seems to be more related to caching rather than vectorization. So when you tested with the dummy arrays with 3000 elements (24 kilobytes):
double *a2 = (double *)aligned_alloc(32, sizeof(double) * N);
double *b2 = (double *)aligned_alloc(32, sizeof(double) * N);
for (int i = 0; i < N ; i++){
a2[i] = 1;
b2[i] = 1;
}
Those small arrays are loaded into the cache during their initialization (i.e., a2[i] = 1
). The matrix a
and b_t
, on the other hand, have a size of 3000 x 3000
(72 megabytes), which makes it unlikely for the first row to be fully on the cache (although this depends on the size of the caches of your testing environment).
I change the following loop :
// problematic calculation that I can't get to speed up no matter what pragma I use
clock_gettime(CLOCK_REALTIME, &begin);
#pragma omp simd aligned(a, b_t:32) reduction(+:cell_res)
for (int i = 0; i < N; i++) {
cell_res += a[0][i] * b_t[0][i];
}
clock_gettime(CLOCK_REALTIME, &end);
by packing the first row of the matrices a
and b_t
, respectively, into two new matrices a_2
and b_2
, namely:
for(int i = 0; i < N; i++){
a_2[0][i] = a[0][i];
bt_2[0][i] = b_t[0][i];
}
// problematic calculation that I can't get to speed up no matter what pragma I use
clock_gettime(CLOCK_REALTIME, &begin);
#pragma omp simd aligned(a_2, bt_2) reduction(+:cell_res)
for (int i = 0; i < N; i++) {
cell_res += a_2[0][i] * bt_2[0][i];
}
By packing those matrices into smaller matrices with only one row, I can load those matrices into the cache, and consequently, reducing the execution time of the first version. The new results:
Time simd reduce measured: 0.000004000 seconds.
time2 (simd reduction): 0.000004000 seconds.
IMO you should not have tested all those loops in the same function because the compiler might optimize those loops differently, then there is the problem of caching those values, and so on. I would have tested them in separate runs.
Now my question is what is my problem that is hindering the SIMD speedup my only guess is that it must be the two-dimensionality of the array but I don't fully get why that would cause a slowdown.
I also tested with packing directly the first row of the matrices a
and b_t
into separate 1D arrays (instead of a matrix) but the results were exactly the same. Take for what it is worth. Now you should profile in your environment, namely testing the cache misses.
More importantly, test this version :
clock_gettime(CLOCK_REALTIME, &begin);
for (int i = 0; i < N; i++) {
cell_res += a[0][i] * b_t[0][i];
}
clock_gettime(CLOCK_REALTIME, &end);
with and without #pragma omp simd aligned(a_2, bt_2:32) reduction(+:cell_res)
, and with and without the packing to the array, but tests all those versions, separately. Moreover, test them for different input sizes.
Second Problem:
Another problematic thing is that:
for (int i = 0; i < N; i++) {
cell_res += a[0][i] * b_t[0][i];
}
is memory-bound, so there is less opportunity for gains with SIMD
, one should not expect much gains with SIMD
from a double-precision floating dot product. A workaround is to change the matrices from double to floats, consequently, reducing to half the memory bandwidth necessary, and doubling the number of SIMD
operations that you can perform. Nonetheless, the aforementioned code snippet will still be memory-bound. Notwithstanding, you might achieve some gains, mainly for when the values are in cache.
In my machine, changing from double to floats, made the SIMD
version noticeable faster than the version without it, even without using the packing. This might also be issue that you have.