algorithm performance optimization bit-manipulation micro-optimization

Can compilers ever optimize variables to use less than a byte of space?

I'm thinking about really granular code optimization for a case where enumerated types are being stored in an array or hash table.

I've got an array of animal enums, where each enum has 4 different categories it can represent, Cat, Dog, Fish, and Bird (is there a name for the categories of an enumerated type, by the way?). I need to check if every value in a certain range is the same. Here's the unoptimized way of doing this:

func same([]Animal data, int start, int end) -> Animal or None:
    if (end - start < 0 || end > data.end || start < data.start):  // Bounds checks
        return None
    else if (start == end):                                        // Trivial case
        return data[min_index]
    else:                                                          // Meat & potatoes
        first_value = data[min_index]
        relevant_data = data[min_index:max_index]                      // Slice of relevant parts
        for (value in relevant_data):                                  // Iterate over relevant data
            if value != first_value:                                       // Check if same
                return None                                                // Reject on first non-match
        return first_value                                             // Accept on base case

Now this is fine, it's O(n) time complexity for worst and average case, but it involves that pesky if every time, which I think risks branch mispredictions at the compiler level. The more elegant way of doing this is to store data differently, so that instead of it being implicitly stored as something like this:

Animal = Enum(
    Cat       // 0b00
    Dog       // 0b01
    Fish      // 0b10
    Bird      // 0b11
)

We can make data instead be stored as something like this:

Animal = SuperSpecialEnum(
    Cat       // 0b0001
    Dog       // 0b0010
    Fish      // 0b0100
    Bird      // 0b1000
)

Then, we can use this code instead:

func same([]Animal data, int start, int end) -> Animal or None:
    if (end - start < 0 || end > data.end || start < data.start):  // Bounds checks
        return None
    else if (start == end):                                        // Trivial case
        return data[min_index]
    else:                                                          // Thanksgiving
        first_value = data[min_index]
        relevant_data = data[min_index:max_index]                      // Slice of relevant parts

        for (value in relevant_data):                                  // Iterate over relevant data
            first_value &= value                                           // Bitwise and check

        if (first_value == 0b0000):
            return None                                                // Reject on non-match case
        else:
            return first_value                                         // Accept on base case

Now, because of that bitwise first_value &= value, we can avoid branching entirely. Sure, we give up on the early reject case, which would have saved us, well that's actually a bit hard to say, I think you'd need to consider the binomial distribution over your overall probability that any given value is different. But the benefit is that you've entirely eliminated branching, and many modern compilers and architectures support 128-bit and operations, so this operation could be really, really fast. You're still O(n) time complexity for worst and average case, but you're potentially cutting down on the number of iterations by 16x (16-value and's with 128-bit boolean arithmetic, assuming the compiler knows what it's doing, and it usually does), and completely eliminating the risk of branching mispredictions.

Now the real question is twofold, though both questions are really different applications of the same question of whether the compiler optimizes sub-byte values to use less space. One, would you be using 2x as much space to store data because of the redefinition of enum to use one bit per value instead of log(values) bits, assuming constant enum category count (still interested in if there's a proper name for those categories by the way). Two, could you potentially get a 32x speedup if the compiler knows you're only using the first 4 bits of each Animal thereby allowing for 32-value and's with 128-bit boolean arithmetic?

Solution

Most architectures only support 128-bit AND via SIMD, and in that case they usually also support compare-for-equality on packed bytes / int16 / int32. e.g. x86 pcmpeqb/w/d/q.

You can AND together some compare results and check at the end (of a cache line or whole array) that every element of your SIMD vector had a "true", i.e. that every element of the array matched the first element.

You could do this on packed 2-bit fields, after doing a bit-broadcast somehow to duplicate the first 2-bit field to all other pairs in a byte, and byte-broadcast that to a whole SIMD vector (like AVX2 vpbroadcastb, or in C intrinsics _mm_set1_epi8). Compare for equality still works for this case for packed 2-bit or 4-bit fields, although it might take a couple extra front-end uops per 16-byte vector to load + pcmpeqb with another reg + pand, vs. just pand with a memory source operand. Still, allowing twice as dense a representation makes up for that on most CPUs, especially when you consider cutting your cache footprint in half.

I don't think any compilers will do this enum renumbering for you, or packing of an array of enums to store two nibble elements per byte.

Definitely not any current C compilers; the language (and ABIs for the language) define too much about type widths and data layouts, and it's usually not worth the compiler's time to look for the rare cases when a data structure is completely private to a function and totally restructure how the types work.

(Moreover, it's usually not worth compiler writers time to try to write code that can safely / correctly do major high-level transformations that make data layout in memory different from what the source says. Optimizing away an array entirely is certainly done, but changing its type isn't something I've seen C compilers do.)

But sure, it maybe be possible in theory, especially in languages unlike C where enum doesn't define how things are auto-numbered. (In C that is well-defined: start at zero and increment by 1, unless you override it like enum foo { a = 1<<0, b = 1<<1, ... };

Ahead-of-time compiled languages will be targeting an ABI that's defined for the platform (e.g. the x86-64 System V ABI when compiling for x86-64 GNU/Linux). This lets code from different compilers, and different versions / optimization settings of the same compiler, call each other.

Having an enum as a function arg or return value makes it part of the ABI, for non-static functions (i.e. ones that could be called from separately-compiled code). So other than link-time optimization (and inlining), compilers don't have a choice about data representations across non-inline-function boundaries. (Except sometimes with inter-procedural optimizations, sometimes enabled by link-time optimization across source files.)

Also keep in mind that C compilers usually care about being useful for low-level systems programming, including interacting with hardware or being able to memory-map files. If that can happen, then data representations become externally visible. That's a reason why compilers wouldn't want to consider doing data packing that changes how an array is stored. It's hard to prove (outside of a local whose address never escapes a function) that nothing else might care about the data layout of an array.

Atomicity / thread safety for modifying adjacent array elements

C/C++ guarantee that different threads can modify different objects without disturbing each other. (since C11 / C++11 introduced a thread-aware memory model).

This includes adjacent elements of a char array[] or enum foo array[]. Each array element is a separate object. (As well as being part of the whole array object). C++ memory model and race conditions on char arrays and Can modern x86 hardware not store a single byte to memory? (it can, so can every ISA with byte-store instructions)

In a single-threaded language, or one without a guarantee like this, yes you could in theory have an implementation that packed enums into sub-byte fields.

Fun fact: in C++ std::vector<bool> is required to be a packed bitmap template specialization (useful data structure, unfortunately historical choice of class to expose it through). This makes it unsafe for different threads to be doing vbool[1] = false and vbool[2] = false at the same time, unlike any other std::vector where that is safe.

Can compilers ever optimize variables to use less than a byte of space?

Atomicity / thread safety for modifying adjacent array elements

If you want this optimization, you generally have to write it in the source yourself