Search code examples
c++roundingsimdsseunsigned

simd: round up (ceil) the log2 of an input, while clamping negative logs to zero?


is there any simd that can ceil a float (round up) and than cast it to unsigned int without wrap? (i.e. any negative number become 0)?

maybe _mm_cvttps_epi32? not sure about rounding and wrap though (I think i got ub):

__m128 indices = _mm_mul_ps(phaseIncrement.v, mMinTopFreq.v);
indices = sse_mathfun_log_ps(indices) / std::log(2.0f);
__m128i finalIndices = _mm_cvttps_epi32(indices);

or any fancy way? I'm compiling with -O3 -march=nehalem -funsafe-math-optimizations -fno-omit-frame-pointer (so sse4.1 are allowed).

EDIT here's a version of the revisited code, after Peter Cordes's suggestions:

__m128 indexesFP = _mm_mul_ps(phaseIncrement.v, mMinTopFreq.v);
indexesFP = sse_mathfun_log_ps(indexesFP) * (1.0f / std::log(2.0f));
indexesFP = _mm_ceil_ps(indexesFP);
__m128i indexes = _mm_cvttps_epi32(indexesFP);
indexes = _mm_max_epi32(indexes, _mm_set_epi32(0, 0, 0, 0));
for (int i = 0; i < 4; i++) {
    int waveTableIndex = _mm_extract_epi32(indexes, i);
    waveTablesData[i] = &mWaveTables[waveTableIndex];
}

any improvements that can be made?


Solution

  • since the range will be limited (such as [0, 16] in the extreme case

    Oh, this doesn't need to work for numbers greater than INT_MAX, up to UINT_MAX? That's much easier than the problem as stated at the top of the question. Yeah just _mm_ceil_ps and use a signed conversion to epi32 (int32_t) and use _mm_min_epi32 for the upper limit, and probably _mm_max_epi32 for the lower limit. (Only one instruction instead of shift/and).

    Or possibly _mm_sub_ps to range-shift to -16..0 / _mm_cvttps_epi32 to truncate (upwards towards zero), then integer subtract from zero. _mm_ceil_ps costs 2 uops on most CPUs, so that's about break-even, trading an FP operation for integer though. But requires more setup.

    Integer min/max are cheaper (lower latency, and better throughput) than FP, so prefer clamp after conversion. Out-of-range floats convert to INT_MIN (high-bit set, others zero, what Intel calls the "integer indefinite" value) so will clamp to 0.


    If you have a lot of this to do in a loop that doesn't do other FP computation, change the MXCSR rounding mode for this loop to round towards +Inf. Use _mm_cvtps_epi32 (which uses the current FP rounding mode, like lrint / (int)nearbyint) instead of ceil + cvtt (truncation).


    This use-case: ceil(log2(float))

    You could just pull that out of the FP bit pattern directly, and round up based on a non-zero mantissa. Binary floating point already contains a power-of-2 exponent field, so you just need to extract that with a bit of massaging.

    Like _mm_and_ps / _mm_cmpeq_epi32 / _mm_add_epi32 to add the -1 compare result for FP values with a zero mantissa, so you treat powers of 2 differently from anything higher.

    Should be faster than computing an FP log base e with a fractional part, even if it's only a quick approximation. Values smaller than 1.0 whose biased exponent is negative may need some extra handling.

    Also, since you want all four indices, probably faster to just store to an array of 4 uint32_t values and access it, instead of using movd + 3x pextrd.

    An even better way to round up to the next exponent for floats with a non-zero mantissa is to simply do an integer add of 0x007fffff to the bit-pattern. (23 set bits: https://en.wikipedia.org/wiki/Single-precision_floating-point_format).

    // we round up the exponent separately from unbiasing.
    // Might be possible to do better
    __m128i ceil_log2_not_fully_optimized(__m128 v)
    {
        // round up to the next power of 2 (exponent value) by carry-out from mantissa field into exponent
        __m128i floatbits = _mm_add_epi32(_mm_castps_si128(v), _mm_set1_epi32(0x007fffff));    
    
        __m128i exp = _mm_srai_epi32(floatbits, 23);   // arithmetic shift so negative numbers stay negative
        exp = _mm_sub_epi32(exp, _mm_set1_epi32(127));  // undo the bias
        exp = _mm_max_epi32(exp, _mm_setzero_si128());  // clamp negative numbers to zero.
        return exp;
    }
    

    If the exponent field was already all-ones, that means +Inf with an all-zero mantissa, else NaN. So it's only possible for carry propagation from the first add to flip the sign bit if the input was already NaN. +Inf gets treated as one exponent higher than FLT_MAX. 0.0 and 0.01 should both come out to 0, if I got this right.

    According to GCC on Godbolt, I think so: https://godbolt.org/z/9G9orWj16 GCC doesn't fully constant-propagate through it, so we can actually see the input to pmaxsd, and see that 0.0 and 0.01 come out to max(0, -127) and max(0,-3) = 0 each. And 3.0 and 4.0 both come out to max(0, 2) = 2.


    We can maybe even combine that +0x7ff... idea with adding a negative number to the exponent field to undo the bias.

    Or to get carry-out into the sign bit correct, subtracting from it, with a 1 in the mantissa field so an all-zero mantissa will propagate a borrow and subtract one more from the exponent field? But small exponents less than the bias could still carry/borrow out and flip the sign bit. But that might be ok if we're going to clamp such values to zero anyway if they come out as small positive instead of negative.

    I haven't worked out the details of this yet; if we need to handle the original input being zero, this might be a problem. If we can assume the original sign bit was cleared, but log(x) might be negative (i.e. exponent field below the bias), this should work just fine; carry out of the exponent field into the sign bit is exactly what we want in that case, so srai keeps it negative and max chooses the 0.

        // round up to the next power of 2 (exponent value) while undoing the bias
        const uint32_t f32_unbias = ((-127)<<23) + 0x007fffffU;
        ???
        profit