Search code examples
c++algorithmrustbit-manipulationmicro-optimization

Is there a faster algorithm for max(ctz(x), ctz(y))?


For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))?

ctz represents "count trailing zeros".

C++ version (Compiler Explorer)

#include <algorithm>
#include <bit>
#include <cstdint>

int32_t test2(uint64_t x, uint64_t y) {
    return std::max(std::countr_zero(x), std::countr_zero(y));
}

Rust version (Compiler Explorer)

pub fn test2(x: u64, y: u64) -> u32 {
    x.trailing_zeros().max(y.trailing_zeros())
}

Solution

  • These are equivalent:

    • max(ctz(a),ctz(b))
    • ctz((a|-a)&(b|-b))
    • ctz(a)+ctz(b)-ctz(a|b)

    The math-identity ctz(a)+ctz(b)-ctz(a|b) requires 6 CPU instructions, parallelizable to 3 steps on a 3-way superscalar CPU:

    • 3× ctz
    • 1× bitwise-or
    • 1× addition
    • 1× subtraction

    The bit-mashing ctz((a|-a)&(b|-b)) requires 6 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:

    • 2× negation
    • 2× bitwise-or
    • 1× bitwize-and
    • 1× ctz

    The naïve max(ctz(a),ctz(b)) requires 5 CPU instructions, parallelizable to 4 steps on a 2-way superscalar CPU:

    • 2× ctz
    • 1× comparison
    • 1× conditional branch
    • 1× load/move (so that the "output" is always in the same register)

    ... but note that branch instructions can be very expensive.

    If your CPU has a conditional load/move instruction, this reduces to 4 CPU instructions taking 3 super-scalar steps.

    If your CPU has a max instruction (e.g. SSE4), this reduces to 3 CPU instructions taking 2 super-scalar steps.

    All that said, the opportunities for super-scalar operation depend on which instructions you're trying to put against each other. Typically you get the most by putting different instructions in parallel, since they use different parts of the CPU (all at once). Typically there will be more "add" and "bitwise or" units than "ctz" units, so doing multiple ctz instructions may actually be the limiting factor, especially for the "math-identity" version.

    If "compare and branch" is too expensive, you can make a non-branching "max" in 4 CPU instructions. Assuming A and B are positive integers:

    1. C = A-B
    2. subtract the previous carry, plus D, from D itself (D is now either 0 or -1, regardless of whatever value it previously held)
    3. C &= D (C is now min(0, A-B))
    4. A -= C (A' is now max(A,B))