I am currently reading an article on github about performance optimisation using Clang's extended vector syntax. The author gives the following code snippet:
The templated code below implements the innermost loops that calculate a patch of size regA x regB in matrix C. The code loads regA scalars from matrixA and regB SIMD-width vectors from matrix B. The program uses Clang's extended vector syntax.
/// Compute a RAxRB block of C using a vectorized dot product, where RA is the
/// number of registers to load from matrix A, and RB is the number of registers
/// to load from matrix B.
template <unsigned regsA, unsigned regsB>
void matmul_dot_inner(int k, const float *a, int lda, const float *b, int ldb,
float *c, int ldc) {
float8 csum[regsA][regsB] = {{0.0}};
for (int p = 0; p < k; p++) {
// Perform the DOT product.
for (int bi = 0; bi < regsB; bi++) {
float8 bb = LoadFloat8(&B(p, bi * 8));
for (int ai = 0; ai < regsA; ai++) {
float8 aa = BroadcastFloat8(A(ai, p));
csum[ai][bi] += aa * bb;
}
}
}
// Accumulate the results into C.
for (int ai = 0; ai < regsA; ai++) {
for (int bi = 0; bi < regsB; bi++) {
AdduFloat8(&C(ai, bi * 8), csum[ai][bi]);
}
}
}
The code, outlines below, confuses me the most. I read the full article and understood the logic behind using blocking and calculating a small patch, but I can't entirely understand what does this bit means:
// Perform the DOT product.
for (int bi = 0; bi < regsB; bi++) {
float8 bb = LoadFloat8(&B(p, bi * 8)); //the pointer to the range of values?
for (int ai = 0; ai < regsA; ai++) {
float8 aa = BroadcastFloat8(A(ai, p));
csum[ai][bi] += aa * bb;
}
}
}
Can anyone elaborate what's going on in here? The article could be found here
The 2nd comment on the article links to https://github.com/pytorch/glow/blob/405e632ef138f1d49db9c3181182f7efd837bccc/lib/Backends/CPU/libjit/libjit_defs.h#L26 which defines the float8
type as
typedef float float8 __attribute__((ext_vector_type(8)));
(similar to how immintrin.h defines __m256
). And defines the load / broadcast functions Similar to _mm256_load_ps
and _mm256_set1_ps
. With that header, you should be able to compile the code in the article.
See Clang's native vector documentation. GNU C native vector syntax is a nice way to get an overloaded *
operator. I don't know what clang's ext_vector_type
does that GCC/clang/ICC float __attribute__((vector_width(32)))
(32 byte width) wouldn't.
The article could have added 1 small section to explain that, but it seems it was more focused on the performance details, and wasn't really interested in explaining how to use syntax.
Most of the discussion in the article is about how to manually vectorize matmul for cache efficiency with SIMD vectors. That part looks good from the quick skim I gave it.
You can do those things with any of multiple ways to manually vectors: GNU C native vectors or clang's very similar "extended" vectors, or portable Intel intrinsics.