prefetching pd (4 double) into __m256d register

I want to prefetch some data using AVX. I was checking the Intel IntrisicsGuide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) but there exists only the _mm_prefetch(...) for SSE. Does anyone know a workaround for AVX? Update 19.02.15:
Maybe i am misunderstanding the purpose of prefetching. So i wanted to describe the problem a bit more in detail:

#include <x86intrin.h>
...
__m128 x0 = ...;
...
// doing some vector operations ...
for (int i=0; i<ndiv4; ++i) {
    _mm_prefetch((char*) y+4*i+8, _MM_HINT_NTA ); //prefetch data fro two iteratrions later
    __m128 x1 = _mm_load_ps(x+4*i); // aligned load
    __m128 x2 = _mm_mul_ps(x0,x1); // x0 defined earlier
    _mm_store_ps(x+4*i,x2); // store aligned back
}

(i know that the prefetch might not necessarily help in this case).
My question is, if or how i could do it using __m256d registers and pd respectively?

Solution

I think the literal answer to "how i could do it using __m256d registers and pd respectively?" would be this:

for (int i=0; i<ndiv8; ++i) {
    _mm_prefetch((char*) y+8*i+16, _MM_HINT_NTA ); //prefetch data fro two iteratrions later
    __m256 x1 = _mm_load_pd(x+8*i); // aligned load
    __m256 x2 = _mm_mul_pd(x0,x1); // x0 defined earlier
    _mm_store_pd(x+8*i,x2); // store aligned back
}

Changing "_ps" to "_pd", "128" to "256", and "4" to "8" as appropriate. Given that you're consuming data twice as fast, though, the prefetch stride might need to be adjusted a bit, but that's a bit of a black art that's best accomplished with benchmarking...