I have a 4D vector: std::array<float, 4>
I want to check if all it's components are inside the value range: 0.0f <= X && X < 256.0f
How do I check if any of the vector components are outside of this range? All I need is a single bool to know if that whole vector passes or fails the test.
My first attempt to solve this was the following piece of code:
bool Check_If_Outside_2(std::array<float, 4> vec)
{
bool outside = false;
for (int i = 0; i < 4; i++)
if (vec[i] < VType(0) || vec[i] >= VType(256))
outside = true;
return outside;
}
Which resulted in this assembler output:
Check_If_Outside_2(std::array<float, 4ul>): # @Check_If_Outside_2(std::array<float, 4ul>)
vxorps xmm2, xmm2, xmm2
vucomiss xmm0, xmm2
setb al
vmovss xmm3, dword ptr [rip + .LCPI7_0] # xmm3 = mem[0],zero,zero,zero
vucomiss xmm0, xmm3
setae cl
or cl, al
vmovshdup xmm0, xmm0 # xmm0 = xmm0[1,1,3,3]
vucomiss xmm0, xmm2
setb dl
vucomiss xmm0, xmm3
setae al
or al, dl
or al, cl
vxorps xmm0, xmm0, xmm0
vcmpltps xmm0, xmm1, xmm0
vpermilps xmm0, xmm0, 212 # xmm0 = xmm0[0,1,1,3]
vmovaps xmm2, xmmword ptr [rip + .LCPI7_1] # xmm2 = <2.56E+2,2.56E+2,u,u>
vcmpleps xmm1, xmm2, xmm1
vpermilps xmm1, xmm1, 212 # xmm1 = xmm1[0,1,1,3]
vorps xmm0, xmm0, xmm1
vpsllq xmm0, xmm0, 63
vmovmskpd ecx, xmm0
or al, cl
shr cl
or al, cl
and al, 1
ret
Then I tried the following optimized version, which uses the idea, that negative integer values fill the top bits in the register. So I simply convert the floats into integers and test the high bits if any of them are non-zero. If there are non-zero bits, the vector's component is either negative or above the legal maximum range.
template<typename T, unsigned long N>
static inline std::array<int32_t, N> To_Int_Vec(const std::array<T, N>& x)
{
std::array<int32_t, N> int_vec;
for (int i = 0; i < N; ++i)
int_vec[i] = floor(x[i]);
return int_vec;
}
bool Check_If_Outside(std::array<float, 4> vec)
{
constexpr int32_t neg_mask = ~255;
auto vec_int = To_Int_Vec(vec);
bool outside = false;
for (int i = 0; i < 4; i++)
if (vec_int[i] & neg_mask)
outside = true;
return outside;
}
Which gave the following assembler output:
Check_If_Outside(std::array<float, 4ul>): # @Check_If_Outside(std::array<float, 4ul>)
vshufps xmm0, xmm0, xmm1, 65 # xmm0 = xmm0[1,0],xmm1[0,1]
vroundps xmm0, xmm0, 9
vcvttps2dq xmm0, xmm0
vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1]
vpor xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 229 # xmm1 = xmm0[1,1,2,3]
vpor xmm0, xmm0, xmm1
vmovd eax, xmm0
cmp eax, 255
seta al
ret
I would assume that this kind of test should still be possible to optimize much further using Intel SIMD instructions, but I'm not sure how to do it. Is it possible to handle it any better in pure C++ or are intrinsics required, or even inline assembler? Or is it even possible to optimize it any further?
To see the optimized output, I'm using x86-64 clang 11.0.0, with compiler flags: -O3 -mtune=skylake -ffast-math -funsafe-math-optimizations -fno-math-errno -msse4.1 -mavx -mfma4
EDIT: Issue solved with your help! Thank you!
Using SIMD directly even seems simpler than the original code:
bool Check_If_Outside(std::array<float, 4> vec)
{
__m128 v = _mm_loadu_ps(vec.data());
__m128 tooHigh = _mm_cmpge_ps(v, _mm_set1_ps(256));
return _mm_movemask_ps(_mm_or_ps(v, tooHigh));
}
Resulting code (at least the way I compiled it):
Check_If_Outside(std::array<float, 4ul>): # @Check_If_Outside(std::array<float, 4ul>)
vmovlhps xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
vbroadcastss xmm1, dword ptr [rip + .LCPI1_0] # xmm1 = [2.56E+2,2.56E+2,2.56E+2,2.56E+2]
vcmpleps xmm1, xmm1, xmm0
vorps xmm0, xmm0, xmm1
vmovmskps eax, xmm0
test eax, eax
setne al
ret
So we skip the conversion to integers, and instead of doing a horizontal-OR we rely on vmovmskps
.
Another idea, based on comparing the floats by their raw bit pattern,
bool Check_If_Outside(std::array<float, 4> vec)
{
__m128i v = _mm_loadu_si128((__m128i*)vec.data());
__m128i bits256 = _mm_set1_epi32(0x43800000);
__m128i cmp = _mm_cmpeq_epi32(_mm_min_epu32(v, bits256), bits256);
return _mm_movemask_ps(_mm_castsi128_ps(cmp));
}
Either way I am blatantly ignoring the existence of negative zero, and of NaN but -ffast-math
already ignores NaN anyway.
uiCA results for Skylake, if I pretend that the input comes from memory (instead of shuffling them together from registers - to break any accidental dependency):
In all cases this is only some rough indication. The code here is analyzed by itself, not in any real context, but throughput is context-dependent (it affects eg execution port pressure). Also, the setcc
should disappear in most real usages (where this function is inlined into the caller and the condition is immediately branched on). And shuffling the function arguments together into one vector (or loading the input from memory, which I used to do the uiCA analysis) would be replaced by whatever actually produces the input.