Fully Connected Layer (dot product) using AVX

I have the following C++ code to perform the multiply and accumulate steps of a fully connected layer (without the bias). Basically I just do a dot product using a vector (inputs) and a matrix (weights). I used AVX vectors to speed up the operation.

const float* src = inputs[0]->buffer();
const float* scl = weights->buffer();
float* dst = outputs[0]->buffer();

SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
SizeVector out_dims = outputs[0]->getTensorDesc().getDims();

const int in_neurons = static_cast<int>(in_dims[1]);
const int out_neurons = static_cast<int>(out_dims[1]);    

for(size_t n = 0; n < out_neurons; n++){
    float accum = 0.0;
    float temp[4] = {0,0,0,0};
    float *p = temp;

    __m128 in, ws, dp;

    for(size_t i = 0; i < in_neurons; i+=4){

        // read and save the weights correctly by applying the mask
        temp[0] = scl[(i+0)*out_neurons + n];
        temp[1] = scl[(i+1)*out_neurons + n];
        temp[2] = scl[(i+2)*out_neurons + n];
        temp[3] = scl[(i+3)*out_neurons + n];

        // load input neurons sequentially
        in = _mm_load_ps(&src[i]);

        // load weights
        ws = _mm_load_ps(p);

        // dot product
        dp = _mm_dp_ps(in, ws, 0xff);

        // accumulator
        accum += dp.m128_f32[0]; 
    }
    // save the final result
    dst[n] = accum.m128_f32[0];
}

It works but the speedup is far from what I expected. As you can see below a convolutional layer with x24 more operations than my custom dot product layer takes less time. This makes no sense and there should be much more room for improvements. What are my major mistakes when trying to use AVX? (I'm new to AVX programming so I don't fully understand from where I should start to look to fully optimize the code).

**Convolutional Convolutional Layer Fully Optimized (AVX)**
Layer: CONV3-32 
Input: 28x28x32 = 25K   
Weights: (3*3*32)*32 = 9K   
Number of MACs: 3*3*27*27*32*32 = 7M    
Execution Time on OpenVINO framework: 0.049 ms

**My Custom Dot Product Layer Far From Optimized (AVX)**
Layer: FC
Inputs: 1x1x512
Outputs: 576    
Weights: 3*3*64*512 = 295K  
Number of MACs: 295K    
Execution Time on OpenVINO framework: 0.197 ms

Thanks for all help in advance!

Solution

Addendum: What you are doing is actually a Matrix-Vector-product. It is well-understood how to implement this efficiently (although with caching and instruction-level parallelism it is not completely trivial). The rest of the answer just shows a very simple vectorized implementation.

You can drastically simplify your implementation by incrementing n+=8 and i+=1 (assuming out_neurons is a multiple of 8, otherwise, some special handling needs to be done for the last elements), i.e., you can accumulate 8 dst values at once.

A very simple implementation assuming FMA is available (otherwise use multiplication and addition):

void dot_product(const float* src, const float* scl, float* dst,
                 const int in_neurons, const int out_neurons)
{
    for(size_t n = 0; n < out_neurons; n+=8){
        __m256 accum = _mm256_setzero_ps();

        for(size_t i = 0; i < in_neurons; i++){
            accum = _mm256_fmadd_ps(_mm256_loadu_ps(&scl[i*out_neurons+n]), _mm256_set1_ps(src[i]), accum);
        }
        // save the result
        _mm256_storeu_ps(dst+n ,accum);
    }
}

This could still be optimized e.g., by accumulating 2, 4, or 8 dst packets inside the inner loop, which would not only save some broadcast operations (the _mm256_set1_ps instruction), but also compensate latencies of the FMA instruction.

Godbolt-Link, if you want to play around with the code: https://godbolt.org/z/mm-YHi