The simdlen value when a processor core has multiple vector pipelines

I am reading the OpenMP 4.5 standard and trying to make my mind about the !$omp simd / #pragma omp simd directive. Specifically, that is not clear for me what are the allowed simdlen values.

If I have a processor core with one floating point unit (FPU) capabable of 256-bit vector operations, I would use simdlen(4) for 64-bit floating point variables.

But what simdlen value should I use if a core has two independent vector pipelines with 128-bit registers?

Solution

tl;dr:

The standard makes no connection between specific hardware architectures and the simdlen clause of the simd construct, so it's implementation defined.

I would first add the question: Do you need to use simdlen at all?

From my experience with different implementations with AVX2 and AVX-512, I'd say: no, it is no necessary in order to utilise both VPUs per core on Xeon and Xeon Phi, but it can be somewhat beneficial for the performance of the generated code to use twice the native register size as argument. I think the intended use is a different one (see background).

From the standard:

According to the standard (p. 74, l. 22), the simdlen clause for the simd construct (as opposed to the declare simd construct) specifies the preferred behaviour, while the actual behaviour, and thus the answer to the original question, is implementation defined:

If used, the simdlen clause specifies the preferred number of iterations to be executed concurrently. The parameter of the simdlen clause must be a constant positive integer. The number of iterations that are executed concurrently at any given time is implementation defined.

The only constraints for the allowed value stated in the standard are:

The parameter of the safelen clause must be a constant positive integer expression.

If both simdlen and safelen clauses are specified, the value of the simdlen parameter must be less than or equal to the value of the safelen parameter.

Background:

The simdlen clause was added to the simd construct (see Section 2.8.1 on page 72) to support specification of the exact number of iterations desired per SIMD chunk.

This can be used to call a matching SIMD-function generated with the declare simd construct and a corresponding simdlen clause, where the latter has slightly different semantics:

If a SIMD version is created, the number of concurrent arguments for the function is determined by the simdlen clause.

Hope that helps.