Fast implementation of set product cardinality

I need a fast implementation of set product cardinality, e.g. using performance tricks to reduce branching and memory accesses.

In C++, the signature would look something like:

enum cardinality { ZERO, ONE, AT_MOST_ONE, ANY, AT_LEAST_ONE}

cardinality product(cardinality a, cardinality b) {
    
}

This table enumerates the expected result of this function:

   0  1  ?  *  +
0  0  0  0  0  0
1  0  1  ?  *  +
?  0  ?  ?  *  *
*  0  *  *  *  *
+  0  +  *  *  +

Ideas

Use a 5x5 lookup table
Sort and merge inputs and use a 10-entry lookup table
Use a tree of intelligently selected branches
Smart bit representations and bit twiddling

For personal reference, this is the solution I arrived at with help from here and ChatGPT:

enum cardinality {
    zero         = 0b001, // 0
    one          = 0b000, // 1
    at_most_one  = 0b100, // ?
    many         = 0b110, // * 
    at_least_one = 0b010  // +
};

cardinality product(cardinality a, cardinality b) {
    int bit0 = 0b001 & (a | b);
    int bit1 = 0b010 & (a | b);
    int bit2 = 0b100 & (a | b);
    return (cardinality)(bit0 || (bit1 | bit2));
}

Godbolt:

product(cardinality, cardinality):
        or      edi, esi
        xor     eax, eax
        and     edi, 7
        setne   al
        ret

Solution

I could think of two implementations:

The obvious 2d table lookup
Lookup into 6 bit table. Since we only need 3 bits to store a single cardinality value, shift/or two of them together to generate a 6 bit value for a lookup.

#include <cstdint>
#include <array>
#include <iostream>

enum cardinality : uint8_t { ZERO, ONE, AT_MOST_ONE, ANY, AT_LEAST_ONE };

cardinality table[25] = {
            ZERO,  ZERO,  ZERO,  ZERO,  ZERO,
            ZERO,  ONE,  AT_MOST_ONE,  ANY,  AT_LEAST_ONE,
            ZERO,  AT_MOST_ONE,  AT_MOST_ONE,  ANY,  ANY,
            ZERO,  ANY,  ANY,  ANY,  ANY,
            ZERO,  AT_LEAST_ONE,  ANY,  ANY,  AT_LEAST_ONE
    };

// Table for all combinations of 6 bits
cardinality table2[1 << 6];

// Use simple 2d table as lookup
cardinality product(cardinality a, cardinality b)
{
    return table[uint8_t(a) + (uint8_t(b) * uint8_t(5))];
}

// Use generated 6-bit table. Not all values in table are populated
cardinality product2(cardinality a, cardinality b)
{
    return table2[uint8_t(a) | (uint8_t(b) << 3)];
}

int main(int, char*[])
{
    // Populate the 6 bit table. Could also just be hard-coded. Not all table entries are populated
    // since we only need 25 values but 6 bits stores 64 values.
    for (uint8_t a = 0; a < 5; a++)
    {
        for (uint8_t b = 0; b < 5; b++)
        {
            table2[a | (b << 3)] = product(cardinality(a), cardinality(b));
        }
    }
    uint8_t a, b;
    std::cin >> a;
    std::cin >> b;

    uint8_t p1 = product(cardinality(a), cardinality(b));
    uint8_t p2 = product2(cardinality(a), cardinality(b));

    return table[uint8_t(a) + (uint8_t(b) * uint8_t(5))];
}

I don't know if these are the fastest, but the generated assembly looks quite minimal. Compiler Explorer link: https://godbolt.org/z/7q4zf441b.

The difference in the generate assembly (-O3, Clang 15) is version 1 uses a lea/add while version 2 uses a shl/or:

Version 1:

product(cardinality, cardinality):
        mov     eax, edi
        mov     ecx, esi
        lea     rcx, [rcx + 4*rcx]
        add     rcx, rax
        lea     rax, [rip + table]
        movzx   eax, byte ptr [rax + rcx]
        ret

Version 2:

product2(cardinality, cardinality):
        mov     eax, edi
        mov     ecx, esi
        shl     rcx, 3
        or      rcx, rax
        lea     rax, [rip + table2]
        movzx   eax, byte ptr [rcx + rax]
        ret

I don't think there is any version based on any kind of tree structure that is going to beat these. Maybe there is some bit twiddling that would be faster. If you are calling this often enough to optimize at this level, then I assume the table would be cached and the memory lookup would be pretty cheap. If you are not calling this often and the table is not cached, then there is probably no need to spend much time optimizing this.