Search code examples
c++performanceoptimization

Fast implementation of set product cardinality


I need a fast implementation of set product cardinality, e.g. using performance tricks to reduce branching and memory accesses.

In C++, the signature would look something like:

enum cardinality { ZERO, ONE, AT_MOST_ONE, ANY, AT_LEAST_ONE}

cardinality product(cardinality a, cardinality b) {
    
}

This table enumerates the expected result of this function:

   0  1  ?  *  +
0  0  0  0  0  0
1  0  1  ?  *  +
?  0  ?  ?  *  *
*  0  *  *  *  *
+  0  +  *  *  +

Ideas

  • Use a 5x5 lookup table
  • Sort and merge inputs and use a 10-entry lookup table
  • Use a tree of intelligently selected branches
  • Smart bit representations and bit twiddling

For personal reference, this is the solution I arrived at with help from here and ChatGPT:

enum cardinality {
    zero         = 0b001, // 0
    one          = 0b000, // 1
    at_most_one  = 0b100, // ?
    many         = 0b110, // * 
    at_least_one = 0b010  // +
};

cardinality product(cardinality a, cardinality b) {
    int bit0 = 0b001 & (a | b);
    int bit1 = 0b010 & (a | b);
    int bit2 = 0b100 & (a | b);
    return (cardinality)(bit0 || (bit1 | bit2));
}

Godbolt:

product(cardinality, cardinality):
        or      edi, esi
        xor     eax, eax
        and     edi, 7
        setne   al
        ret

Solution

  • I could think of two implementations:

    1. The obvious 2d table lookup
    2. Lookup into 6 bit table. Since we only need 3 bits to store a single cardinality value, shift/or two of them together to generate a 6 bit value for a lookup.
    #include <cstdint>
    #include <array>
    #include <iostream>
    
    enum cardinality : uint8_t { ZERO, ONE, AT_MOST_ONE, ANY, AT_LEAST_ONE };
    
    cardinality table[25] = {
                ZERO,  ZERO,  ZERO,  ZERO,  ZERO,
                ZERO,  ONE,  AT_MOST_ONE,  ANY,  AT_LEAST_ONE,
                ZERO,  AT_MOST_ONE,  AT_MOST_ONE,  ANY,  ANY,
                ZERO,  ANY,  ANY,  ANY,  ANY,
                ZERO,  AT_LEAST_ONE,  ANY,  ANY,  AT_LEAST_ONE
        };
    
    // Table for all combinations of 6 bits
    cardinality table2[1 << 6];
    
    // Use simple 2d table as lookup
    cardinality product(cardinality a, cardinality b)
    {
        return table[uint8_t(a) + (uint8_t(b) * uint8_t(5))];
    }
    
    // Use generated 6-bit table. Not all values in table are populated
    cardinality product2(cardinality a, cardinality b)
    {
        return table2[uint8_t(a) | (uint8_t(b) << 3)];
    }
    
    int main(int, char*[])
    {
        // Populate the 6 bit table. Could also just be hard-coded. Not all table entries are populated
        // since we only need 25 values but 6 bits stores 64 values.
        for (uint8_t a = 0; a < 5; a++)
        {
            for (uint8_t b = 0; b < 5; b++)
            {
                table2[a | (b << 3)] = product(cardinality(a), cardinality(b));
            }
        }
        uint8_t a, b;
        std::cin >> a;
        std::cin >> b;
    
        uint8_t p1 = product(cardinality(a), cardinality(b));
        uint8_t p2 = product2(cardinality(a), cardinality(b));
    
        return table[uint8_t(a) + (uint8_t(b) * uint8_t(5))];
    }
    

    I don't know if these are the fastest, but the generated assembly looks quite minimal. Compiler Explorer link: https://godbolt.org/z/7q4zf441b.

    The difference in the generate assembly (-O3, Clang 15) is version 1 uses a lea/add while version 2 uses a shl/or:

    Version 1:

    product(cardinality, cardinality):
            mov     eax, edi
            mov     ecx, esi
            lea     rcx, [rcx + 4*rcx]
            add     rcx, rax
            lea     rax, [rip + table]
            movzx   eax, byte ptr [rax + rcx]
            ret
    

    Version 2:

    product2(cardinality, cardinality):
            mov     eax, edi
            mov     ecx, esi
            shl     rcx, 3
            or      rcx, rax
            lea     rax, [rip + table2]
            movzx   eax, byte ptr [rcx + rax]
            ret
    

    I don't think there is any version based on any kind of tree structure that is going to beat these. Maybe there is some bit twiddling that would be faster. If you are calling this often enough to optimize at this level, then I assume the table would be cached and the memory lookup would be pretty cheap. If you are not calling this often and the table is not cached, then there is probably no need to spend much time optimizing this.