I have a loop that takes between 90% and 99% of the program time approximately. It reads a huge LUT, and this loop is executed > 100,000 times, so it deserves some optimization.
EDIT:
The LUT (actually there are various arrays that compose the LUT) is made of arrays of ptrdiff_t
and of unsigned __int128
. They have to be that wide because of the algorithm (especially the 128 bit ones). T_RDY
is the only bool
array.
EDIT:
The LUT stores past combinations used to try to solve a problem that didn't work. There's no relation between them (that I can see yet), so I don't see a more appropriate search pattern.
The single threaded version of the loop is:
k = false;
for (ptrdiff_t i = 0; i < T_IND; i++) {
if (T_RDY[i] && !(~T_RWS[i] & M_RWS) && ((T_NUM[i] + P_LVL) <= P_LEN)) {
k = true;
break;
}
}
With this code, which makes use of OpenMP, I reduced the time between 2x and 3x in a 4 core processor:
k = false;
#pragma omp parallel for shared(k)
for (ptrdiff_t i = 0; i < T_IND; i++) {
if (k)
continue;
if (T_RDY[i] && !(~T_RWS[i] & M_RWS) && ((T_NUM[i] + P_LVL) <= P_LEN))
k = true;
}
EDIT:
Info about the data used:
#define DIM_MAX 128
#define P_LEN prb_lvl[0]
#define P_LVL prb_lvl[1]
#define M_RWS prb_mtx_rws[prb_lvl[1]]
#define T_RWS prb_tab
#define T_NUM prb_tab_num
#define T_RDY prb_tab_rdy
#define T_IND prb_tab_ind
extern ptrdiff_t prb_lvl [2];
extern uint128_t prb_mtx_rws [DIM_MAX];
extern uint128_t prb_tab [10000000];
extern ptrdiff_t prb_tab_num [10000000];
extern bool prb_tab_rdy [10000000];
extern ptrdiff_t prb_tab_ind;
However, the fact that I don't get an improvement of approx. 4x means that it introduces an overhead, which I guess goes from 2x to 1.5x. Part of the overhead is unavoidable (creating and destroying the threads), but there's some new overhead due to the facts that OpenMP doesn't allow to break
from a parallel loop and that I added an if
to each iteration, and I would like to get rid of it if possible.
Is there any other optimization that I could apply? Maybe using pthreads instead.
Should I bother editing some assembly?
I'm using GCC 9 with -O3 -flto (among others).
EDIT:
CPU: i7-5775C
But I plan to use other x64 CPUs with more cores.
You can coalesce k into bit tables and then do comparisons 64 at a time. If an entry in the main tables change, recompute that bit in the bit table.
If different queries use different M_RWS
or P_LVL
or something, then you'd need separate caches for separate search inputs. Or rebuild the cache for their current values, if you do multiple queries between changes. But hopefully that's not the case, otherwise the all-caps names are misleading.
Set up k as a bit table
#define KSZ (10000000/64 + !!(10000000 % 63))
static uint64_t k[KSZ];
void init_k(void){
// We can split this up to minimize cache misses, see below
for (size_t i;i<10000000;++i)
k[i/64] |= (uint64_t)((!!T_RDY[i]) & (!(~T_RWS[i] & M_RWS)) &((T_NUM[i] + P_LVL) <= P_LEN) ) << (i&63);
}
You can find the bit-index into k by searching for a non-zero 64-bit chunk, then using a bitscan to find the bit within that chunk:
size_t k2index(void){
size_t i;
for (i=0; i<KSZ;++i)
if (k[i]) break;
return 64 * i + __builtin_ctzll(k[i]);
}
You may want to split up your data reads so that you get sequential data access (each table is over 40=80MB as described) and don't get a cache miss on every single iteration.
#define KSZ (10000000/64 + !!(10000000%63))
static uint64_t k[KSZ], k0[KSZ], k1[KSZ]; //use calloc instead?
void init_k(void){
//I split these up to minimize cache misses
for (size_t i;i<10000000;++i)
k[i/64] |= (uint64_t)(!!T_RDY[i]) << (i&63);
for (size_t i;i<10000000;++i)
k0[i/64] |= (uint64_t)(!(~T_RWS[i] & M_RWS)) << (i&63);
for (size_t i;i<10000000;++i)
k1[i/64] |= (uint64_t)((T_NUM[i] + P_LVL) <= P_LEN) << (i&63);
//now combine them 64 bits at a time
for (size_t i;i<KSZ;++i)
k[i] &= k0[i];
for (size_t i;i<KSZ;++i)
k[i] &= k1[i];
}
If you split it up like this, you could also initialize (some of) them when you set up your other tables. Or if the tables updated, you could update the k value as well.