Search code examples
cassemblyx86-64column-major-order

question about dgemm test `_mm256_load_pd(C + i + j * n)`


In 'Computer Organization and Design' RISC-V version book by Patterson and Hennessy, 'FIGURE 3.21' shows that _mm256_load_pd(C + i + j * n) is C[i][j] which is weird at first glance for me (the code is similar to berkeley code dgemm_unroll which is from one intel article)

code in the book:

void dgemm_avx256(const uint32_t n, const double* A, const double* B, double* C)
{

    for( uint32_t i = 0; i < n; i += 4 )
    {
        for( uint32_t j = 0; j < n; j++ ) 
        {
            __m256d c0 = _mm256_load_pd(C + i + j * n); /* c0 = C[i][j] */
            for( uint32_t k = 0; k < n; k++ )
            {
                c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
                        _mm256_mul_pd( _mm256_load_pd(A + i + k * n), _mm256_broadcast_sd(B + k + j * n) ) );
            }

            _mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
        }
    }
}

Then I read intel _mm256_load_pd reference online, and the param is '256-bit aligned memory location' as the reference says.


Q: so C + i + j * n should be C[j][i] instead of C[i][j]. Did I make something wrong?


And I tested with gdb, it shows C[1][0] when run _mm256_load_pd second time.

Below is assembly code snippet temporarily with -nopie with some comment with -fverbose-asm and self added:

   0x00000000004035b0 <+48>:    44 89 fe                mov    esi,r15d ; _60, i
   0x00000000004035b3 <+51>:    45 31 d2                xor    r10d,r10d ;ivtmp.20 is `j*n`
...
   0x00000000004035c5 <+69>:    48 8d 04 32             lea    rax,[rdx+rsi*1] ;i+j*n
   0x00000000004035c9 <+73>:    4d 8d 5c c5 00          lea    r11,[r13+rax*8+0x0] ; *8 bytes
   0x00000000004035ce <+78>:    49 8d 04 d4             lea    rax,[r12+rdx*8]
   0x00000000004035d2 <+82>:    4c 01 f2                add    rdx,r14
=> 0x00000000004035d5 <+85>:    c4 c1 7d 28 03          vmovapd ymm0,YMMWORD PTR [r11] ; r11 is C + i + j * n
...
   0x0000000000403602 <+130>:   41 01 fa                add    r10d,edi ; j+=n

Solution

  • Thanks for above comments.

    The code is Fortran-style, although in COD book it says 'Optimized C version of DGEMM using C intrinsics to generate the AVX subword-parallel instructions for the x86.'

    Mark the Q&A as solved.