I have found following SSE2 code written to multiply 2x2 matrix. Can anybody explain me how this code is executing. When I go through the code I feel it just add values into two positions of C(2x2) matrix (C[0],C[3]
). lda
is the size of the large matrix and A,B and C are 2x2 matrix.
static void simd_2x2(int lda, double* A, double* B, double* C)
{
__m128d a,b1,b2,c1,c2;
c1 = _mm_loadu_pd( C+0*lda ); //load unaligned block in C
c2 = _mm_loadu_pd( C+1*lda );
for( int i = 0; i < 2; ++i )
{
a = _mm_load_pd( A+i*lda );//load aligned i-th column of A
b1 = _mm_load1_pd( B+i+0*lda ); //load i-th row of B
b2 = _mm_load1_pd( B+i+1*lda );
c1=_mm_add_pd( c1, _mm_mul_pd( a, b2 ) ); //rank-1 update
c2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) );
}
_mm_storeu_pd( C+0*lda, c1 ); //store unaligned block in C
_mm_storeu_pd( C+1*lda, c2 );
}
I'm guessing the source of your confusion is that the double precision intrinsics (_mm_load_pd
et al) each process a vector of two double precision values. lda
appears to be the stride. So for example:
c1 = _mm_loadu_pd( C+0*lda );
c2 = _mm_loadu_pd( C+1*lda );
loads a 2x2 block of doubles from C, C+1, C+lda, C+lda+1.