Search code examples
loopsoptimizationclang

Simple loop optimized by clang, how?


From CppCon 2014: Mike Acton "Data-Oriented Design and C++", he shows this simple loop

int Foo::Bar(int count)
{
    int value = 0;
    for (int i=0;i<count;i++)
    {
        if ( m_NeedParentUpdate )
        {
            value++;
        }
    }
    return (value);
}

Being optimized by clang as such

enter image description here

I don't understand what is happening here. Why is this code bad, and why is it optimized as such by clang, why does this work?

About the loop, he also says "I'm sure you can optimize this in your head". I don't get it. How do I optimize this?


Solution

  • The loop increments value count times if m_NeedParentUpdate is not false.

    From the generated code, it seems m_NeedParentUpdate is a boolean value stored as an unsigned byte at offset 0 from this. The optimizer probably detects that m_NeedParentUpdate is a constant in the loop, so the test can be moved outside the loop. The programmer should have written the code this way already and it might what Mike Acton refers to by I'm sure you can optimize this in your head.

    Here is a rewritten version:

    class Foo {
        bool m_NeedParentUpdate;
        int Bar(int count);
    };
    
    int Foo::Bar(int count) {
        int value = 0;
        if (m_NeedParentUpdate) {
            for (int i = 0; i < count; i++) {
                value++;
            }
        }
        return value;
    }
    

    Note however that further optimizing the code in your head might lead to reducing the loop to value += count;, but this would be incorrect for negative values of count, which is not so obvious at first sight.

    The optimizer may detect the loop pattern and optimize as:

    int Foo::Bar(int count) {
        int value = 0;
        if (m_NeedParentUpdate) {
            if (count >= 0) {
                value += count;
            }
        }
        return value;
    }
    

    Or equivalently:

    int Foo::Bar(int count) {
        int value = 0;
        if (count >= 0) {
            if (m_NeedParentUpdate) {
                value = count;
            }
        }
        return value;
    }
    

    Converting m_ParentNeedUpdate to an unsigned and negating it produces 0 for false and all bits one for true. Masking count with this value will produce 0 or count.

    int Foo::Bar(int count) {
        int value = 0;
        if (count >= 0) {
            value = -(unsigned)m_NeedParentUpdate & count;
        }
        return value;
    }
    

    Note however that the code still has a test and a branch instruction. It could be further optimized as:

    int Foo::Bar(int count) {
        // equivalent code, but definitely not readable
        return -(unsigned)m_NeedParentUpdate & -(unsigned)(count >= 0) & count;
    }
    

    Compiling this with both gcc and clang produces branchless code as can be seen on GodBolt's Compiler Explorer. But neither compiler reduces the original code to this one liner.