How the following following SSE2 code read data

I have found following SSE2 code written to multiply 2x2 matrix. Can anybody explain me how this code is executing. When I go through the code I feel it just add values into two positions of C(2x2) matrix (C[0],C[3]). lda is the size of the large matrix and A,B and C are 2x2 matrix.

static void simd_2x2(int lda, double* A, double* B, double* C)
{ 
    __m128d a,b1,b2,c1,c2;

    c1 = _mm_loadu_pd( C+0*lda );   //load unaligned block in C
    c2 = _mm_loadu_pd( C+1*lda );
    for( int i = 0; i < 2; ++i )
    {
        a = _mm_load_pd( A+i*lda );//load aligned i-th column of A
        b1 = _mm_load1_pd( B+i+0*lda ); //load i-th row of B
        b2 = _mm_load1_pd( B+i+1*lda );
        c1=_mm_add_pd( c1, _mm_mul_pd( a, b2 ) ); //rank-1 update
        c2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) );
    }
    _mm_storeu_pd( C+0*lda, c1 );   //store unaligned block in C
    _mm_storeu_pd( C+1*lda, c2 );
}

Solution

I'm guessing the source of your confusion is that the double precision intrinsics (_mm_load_pd et al) each process a vector of two double precision values. lda appears to be the stride. So for example:

        c1 = _mm_loadu_pd( C+0*lda );
        c2 = _mm_loadu_pd( C+1*lda );

loads a 2x2 block of doubles from C, C+1, C+lda, C+lda+1.