I have the following C++ code to perform the multiply and accumulate steps of a fully connected layer (without the bias). Basically I just do a dot product using a vector (inputs) and a matrix (weights). I used AVX vectors to speed up the operation.
const float* src = inputs[0]->buffer();
const float* scl = weights->buffer();
float* dst = outputs[0]->buffer();
SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
SizeVector out_dims = outputs[0]->getTensorDesc().getDims();
const int in_neurons = static_cast<int>(in_dims[1]);
const int out_neurons = static_cast<int>(out_dims[1]);
for(size_t n = 0; n < out_neurons; n++){
float accum = 0.0;
float temp[4] = {0,0,0,0};
float *p = temp;
__m128 in, ws, dp;
for(size_t i = 0; i < in_neurons; i+=4){
// read and save the weights correctly by applying the mask
temp[0] = scl[(i+0)*out_neurons + n];
temp[1] = scl[(i+1)*out_neurons + n];
temp[2] = scl[(i+2)*out_neurons + n];
temp[3] = scl[(i+3)*out_neurons + n];
// load input neurons sequentially
in = _mm_load_ps(&src[i]);
// load weights
ws = _mm_load_ps(p);
// dot product
dp = _mm_dp_ps(in, ws, 0xff);
// accumulator
accum += dp.m128_f32[0];
}
// save the final result
dst[n] = accum.m128_f32[0];
}
It works but the speedup is far from what I expected. As you can see below a convolutional layer with x24 more operations than my custom dot product layer takes less time. This makes no sense and there should be much more room for improvements. What are my major mistakes when trying to use AVX? (I'm new to AVX programming so I don't fully understand from where I should start to look to fully optimize the code).
**Convolutional Convolutional Layer Fully Optimized (AVX)**
Layer: CONV3-32
Input: 28x28x32 = 25K
Weights: (3*3*32)*32 = 9K
Number of MACs: 3*3*27*27*32*32 = 7M
Execution Time on OpenVINO framework: 0.049 ms
**My Custom Dot Product Layer Far From Optimized (AVX)**
Layer: FC
Inputs: 1x1x512
Outputs: 576
Weights: 3*3*64*512 = 295K
Number of MACs: 295K
Execution Time on OpenVINO framework: 0.197 ms
Thanks for all help in advance!
Addendum: What you are doing is actually a Matrix-Vector-product. It is well-understood how to implement this efficiently (although with caching and instruction-level parallelism it is not completely trivial). The rest of the answer just shows a very simple vectorized implementation.
You can drastically simplify your implementation by incrementing n+=8
and i+=1
(assuming out_neurons
is a multiple of 8, otherwise, some special handling needs to be done for the last elements), i.e., you can accumulate 8 dst
values at once.
A very simple implementation assuming FMA is available (otherwise use multiplication and addition):
void dot_product(const float* src, const float* scl, float* dst,
const int in_neurons, const int out_neurons)
{
for(size_t n = 0; n < out_neurons; n+=8){
__m256 accum = _mm256_setzero_ps();
for(size_t i = 0; i < in_neurons; i++){
accum = _mm256_fmadd_ps(_mm256_loadu_ps(&scl[i*out_neurons+n]), _mm256_set1_ps(src[i]), accum);
}
// save the result
_mm256_storeu_ps(dst+n ,accum);
}
}
This could still be optimized e.g., by accumulating 2, 4, or 8 dst
packets inside the inner loop, which would not only save some broadcast operations (the _mm256_set1_ps
instruction), but also compensate latencies of the FMA instruction.
Godbolt-Link, if you want to play around with the code: https://godbolt.org/z/mm-YHi