Implementation of convolution using Rust with SIMD instructions

Question's Goal: I was wondering if anyone can help me improve my rust code by helping me understand if I am using the SIMD instructions correctly and/or if there is a more effective method to write the SIMD for my purpose.

Function: The function's goal is to perform convolution using the straight forward approach (meaning not using the FFT). I am only using SIMD instructions to multiply the input with kernel and no where else. Looking at conv_kernel, the function takes two slices of f64 and should be multiplying them together.

Code:

use rayon::prelude::*;
use std::arch::x86_64::*;

const VEC_SIZE: usize = 4;

#[allow(dead_code)]
#[target_feature(enable = "avx2")]
pub unsafe fn conv_simd(input: &[f64], kernel: &[f64]) -> Vec<f64> {
    let input_len = input.len();
    let kernel_len = kernel.len();
    let output_len = input_len - kernel_len + 1;
    let output = (0..output_len).into_par_iter().map(|i|
        (0..kernel_len).step_by(4).map(|k| 
            conv_kernel(&input[i+k..i+k+4], &kernel[k..k+4])
        ).sum()
    ).collect();
    output
}

#[allow(dead_code)]
#[target_feature(enable = "avx2")]
#[inline]
pub unsafe fn conv_kernel(a: &[f64], b: &[f64]) -> f64 {
    let mut output: [f64; VEC_SIZE] = [0.0; VEC_SIZE];
    _mm256_storeu_pd(output.as_mut_ptr(), _mm256_mul_pd(_mm256_set_pd(a[0], a[1], a[2], a[3]), _mm256_set_pd(b[0], b[1], b[2], b[3])));
    output.iter().sum()
}

Additional Notes: This function is not important to any project and is more of an exercise for future work with SIMD if the need arises. Thanks for any responses. This is the first time using SIMD and am not for sure if the complier is using the SIMDs since I am using the normal rust complier and not nightly. The complier complied the code so I assume it is using SIMD. The odd part is that this method rans slightly slower than the same method without SIMD instructions should I am not for sure what the next is to improve this code.

Solution

Assumption about compiler using AVX (futhermore using AVX effectively) is unreliable. Always better to check generated assembly.

_mm256_set_pd isn't a SIMD instruction, it's a sequence of scalar operations. They are implemented either as a series of data shuffling operations. Or by going through memory and taking on the store forwarding stalls. You need to use _mm256_load_pd or Rust equivalent instead. That's the most probable reason of why your code isn't faster.

Multiplication and add doesn't have to be used separately. It can be done in single instruction using FMA _mm256_fmadd_pd or equivalent. There are many other functions documented there.

Basic implementation of this algorithm should be more similar to this in terms of methods used (but re-written in functions and loops, it's just preudo-code so far)

#![feature(portable_simd)]
use std::arch::x86_64::*;
use std::simd::f64x4;
use std::simd::StdFloat;
fn main() {
    //i.. loop here
    let mut acum = f64x4::splat(0.0); //use accumulator to avoid horizontal add
    //j.. loop here
    //simplified data; in a complete algorithm input should be i..
    //and kernel should be i+j..
    let input = [1.0, 2.0, 3.0, 4.0]; 
    let kernel = [4.0, 3.0, 2.0, 1.0];

    let x = f64x4::from_slice(&input[0..4]); //load
    let y = f64x4::from_slice(&kernel[0..4]); //load
    acum = x.mul_add(y,acum); //fused multiply and add
    //end of j.. loop
    let arr = acum.to_array(); //store
    let result: f64 = arr.iter().sum(); //save result to [i] index

}