In 'Computer Organization and Design' RISC-V version book by Patterson and Hennessy, 'FIGURE 3.21' shows that _mm256_load_pd(C + i + j * n)
is C[i][j]
which is weird at first glance for me (the code is similar to berkeley code dgemm_unroll
which is from one intel article)
code in the book:
void dgemm_avx256(const uint32_t n, const double* A, const double* B, double* C)
{
for( uint32_t i = 0; i < n; i += 4 )
{
for( uint32_t j = 0; j < n; j++ )
{
__m256d c0 = _mm256_load_pd(C + i + j * n); /* c0 = C[i][j] */
for( uint32_t k = 0; k < n; k++ )
{
c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */
_mm256_mul_pd( _mm256_load_pd(A + i + k * n), _mm256_broadcast_sd(B + k + j * n) ) );
}
_mm256_store_pd(C+i+j*n, c0); /* C[i][j] = c0 */
}
}
}
Then I read intel _mm256_load_pd
reference online, and the param is '256-bit aligned memory location' as the reference says.
Q: so C + i + j * n
should be C[j][i]
instead of C[i][j]
. Did I make something wrong?
And I tested with gdb, it shows C[1][0]
when run _mm256_load_pd
second time.
Below is assembly code snippet temporarily with -nopie
with some comment with -fverbose-asm
and self added:
0x00000000004035b0 <+48>: 44 89 fe mov esi,r15d ; _60, i
0x00000000004035b3 <+51>: 45 31 d2 xor r10d,r10d ;ivtmp.20 is `j*n`
...
0x00000000004035c5 <+69>: 48 8d 04 32 lea rax,[rdx+rsi*1] ;i+j*n
0x00000000004035c9 <+73>: 4d 8d 5c c5 00 lea r11,[r13+rax*8+0x0] ; *8 bytes
0x00000000004035ce <+78>: 49 8d 04 d4 lea rax,[r12+rdx*8]
0x00000000004035d2 <+82>: 4c 01 f2 add rdx,r14
=> 0x00000000004035d5 <+85>: c4 c1 7d 28 03 vmovapd ymm0,YMMWORD PTR [r11] ; r11 is C + i + j * n
...
0x0000000000403602 <+130>: 41 01 fa add r10d,edi ; j+=n
Thanks for above comments.
The code is Fortran-style, although in COD book it says 'Optimized C version of DGEMM using C intrinsics to generate the AVX subword-parallel instructions for the x86.'
Mark the Q&A as solved.