I am reading the OpenMP 4.5 standard and trying to make my mind about the !$omp simd
/ #pragma omp simd
directive. Specifically, that is not clear for me what are the allowed simdlen
values.
If I have a processor core with one floating point unit (FPU) capabable of 256-bit vector operations, I would use simdlen(4)
for 64-bit floating point variables.
But what simdlen
value should I use if a core has two independent vector pipelines with 128-bit registers?
tl;dr:
The standard makes no connection between specific hardware architectures and the simdlen
clause of the simd
construct, so it's implementation defined.
I would first add the question: Do you need to use simdlen
at all?
From my experience with different implementations with AVX2 and AVX-512, I'd say: no, it is no necessary in order to utilise both VPUs per core on Xeon and Xeon Phi, but it can be somewhat beneficial for the performance of the generated code to use twice the native register size as argument. I think the intended use is a different one (see background).
From the standard:
According to the standard (p. 74, l. 22), the simdlen
clause for the simd
construct (as opposed to the declare simd
construct) specifies the preferred behaviour, while the actual behaviour, and thus the answer to the original question, is implementation defined:
If used, the simdlen clause specifies the preferred number of iterations to be executed concurrently. The parameter of the simdlen clause must be a constant positive integer. The number of iterations that are executed concurrently at any given time is implementation defined.
The only constraints for the allowed value stated in the standard are:
The parameter of the safelen clause must be a constant positive integer expression.
If both simdlen and safelen clauses are specified, the value of the simdlen parameter must be less than or equal to the value of the safelen parameter.
Background:
The simdlen clause was added to the simd construct (see Section 2.8.1 on page 72) to support specification of the exact number of iterations desired per SIMD chunk.
This can be used to call a matching SIMD-function generated with the declare simd
construct and a corresponding simdlen
clause, where the latter has slightly different semantics:
If a SIMD version is created, the number of concurrent arguments for the function is determined by the simdlen clause.
Hope that helps.