AVX load instruction with increment

Is there an AVX instruction that is able to load four double values from a regular, aligned vector with increments? So if I want a call like _mm256_load_pd(a) only with an increment of 4, so that not the values a[0], a[1], a[2] and a[3] are loaded, but a[0], a[4], a[8] and a[12]?

Solution

If you have AVX2 (Haswell and later) then you can use gathered loads, e.g. _mm256_i32gather_pd. From the Intel Intrinsics Guide:

Synopsis

__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

#include "immintrin.h"

Instruction: vgatherdpd ymm, vm64x, ymm

CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

As noted in the comments already, gathered loads are slow on Haswell, but they may still be worthwhile if you need this access pattern for subsequent 256 bit SIMD operations. Since you're using doubles though, any benefit may be small, so you might also want to benchmark against a conventional scalar implementation.