I need a fast implementation of set product cardinality, e.g. using performance tricks to reduce branching and memory accesses.
In C++, the signature would look something like:
enum cardinality { ZERO, ONE, AT_MOST_ONE, ANY, AT_LEAST_ONE}
cardinality product(cardinality a, cardinality b) {
}
This table enumerates the expected result of this function:
0 1 ? * +
0 0 0 0 0 0
1 0 1 ? * +
? 0 ? ? * *
* 0 * * * *
+ 0 + * * +
Ideas
For personal reference, this is the solution I arrived at with help from here and ChatGPT:
enum cardinality {
zero = 0b001, // 0
one = 0b000, // 1
at_most_one = 0b100, // ?
many = 0b110, // *
at_least_one = 0b010 // +
};
cardinality product(cardinality a, cardinality b) {
int bit0 = 0b001 & (a | b);
int bit1 = 0b010 & (a | b);
int bit2 = 0b100 & (a | b);
return (cardinality)(bit0 || (bit1 | bit2));
}
Godbolt:
product(cardinality, cardinality):
or edi, esi
xor eax, eax
and edi, 7
setne al
ret
I could think of two implementations:
cardinality
value, shift/or two of them together to generate a 6 bit value for a lookup.#include <cstdint>
#include <array>
#include <iostream>
enum cardinality : uint8_t { ZERO, ONE, AT_MOST_ONE, ANY, AT_LEAST_ONE };
cardinality table[25] = {
ZERO, ZERO, ZERO, ZERO, ZERO,
ZERO, ONE, AT_MOST_ONE, ANY, AT_LEAST_ONE,
ZERO, AT_MOST_ONE, AT_MOST_ONE, ANY, ANY,
ZERO, ANY, ANY, ANY, ANY,
ZERO, AT_LEAST_ONE, ANY, ANY, AT_LEAST_ONE
};
// Table for all combinations of 6 bits
cardinality table2[1 << 6];
// Use simple 2d table as lookup
cardinality product(cardinality a, cardinality b)
{
return table[uint8_t(a) + (uint8_t(b) * uint8_t(5))];
}
// Use generated 6-bit table. Not all values in table are populated
cardinality product2(cardinality a, cardinality b)
{
return table2[uint8_t(a) | (uint8_t(b) << 3)];
}
int main(int, char*[])
{
// Populate the 6 bit table. Could also just be hard-coded. Not all table entries are populated
// since we only need 25 values but 6 bits stores 64 values.
for (uint8_t a = 0; a < 5; a++)
{
for (uint8_t b = 0; b < 5; b++)
{
table2[a | (b << 3)] = product(cardinality(a), cardinality(b));
}
}
uint8_t a, b;
std::cin >> a;
std::cin >> b;
uint8_t p1 = product(cardinality(a), cardinality(b));
uint8_t p2 = product2(cardinality(a), cardinality(b));
return table[uint8_t(a) + (uint8_t(b) * uint8_t(5))];
}
I don't know if these are the fastest, but the generated assembly looks quite minimal. Compiler Explorer link: https://godbolt.org/z/7q4zf441b.
The difference in the generate assembly (-O3
, Clang 15) is version 1 uses a lea/add
while version 2 uses a shl/or
:
Version 1:
product(cardinality, cardinality):
mov eax, edi
mov ecx, esi
lea rcx, [rcx + 4*rcx]
add rcx, rax
lea rax, [rip + table]
movzx eax, byte ptr [rax + rcx]
ret
Version 2:
product2(cardinality, cardinality):
mov eax, edi
mov ecx, esi
shl rcx, 3
or rcx, rax
lea rax, [rip + table2]
movzx eax, byte ptr [rcx + rax]
ret
I don't think there is any version based on any kind of tree structure that is going to beat these. Maybe there is some bit twiddling that would be faster. If you are calling this often enough to optimize at this level, then I assume the table would be cached and the memory lookup would be pretty cheap. If you are not calling this often and the table is not cached, then there is probably no need to spend much time optimizing this.