c loops pthreads openmp micro-optimization

Micro-optimizing a linear search loop over a huge array with OpenMP: can't break on a hit

I have a loop that takes between 90% and 99% of the program time approximately. It reads a huge LUT, and this loop is executed > 100,000 times, so it deserves some optimization.

EDIT:

The LUT (actually there are various arrays that compose the LUT) is made of arrays of ptrdiff_t and of unsigned __int128. They have to be that wide because of the algorithm (especially the 128 bit ones). T_RDY is the only bool array.

EDIT:

The LUT stores past combinations used to try to solve a problem that didn't work. There's no relation between them (that I can see yet), so I don't see a more appropriate search pattern.

The single threaded version of the loop is:

k   = false;
for (ptrdiff_t i = 0; i < T_IND; i++) {
        if (T_RDY[i] && !(~T_RWS[i] & M_RWS) && ((T_NUM[i] + P_LVL) <= P_LEN)) {
                k = true;
                break;
        }
}

With this code, which makes use of OpenMP, I reduced the time between 2x and 3x in a 4 core processor:

k   = false;
#pragma omp parallel for shared(k)
for (ptrdiff_t i = 0; i < T_IND; i++) {
        if (k)
                continue;
        if (T_RDY[i] && !(~T_RWS[i] & M_RWS) && ((T_NUM[i] + P_LVL) <= P_LEN))
                k = true;
}

EDIT:

Info about the data used:

#define DIM_MAX     128

#define P_LEN       prb_lvl[0]
#define P_LVL       prb_lvl[1]

#define M_RWS       prb_mtx_rws[prb_lvl[1]]

#define T_RWS       prb_tab
#define T_NUM       prb_tab_num
#define T_RDY       prb_tab_rdy
#define T_IND       prb_tab_ind


extern  ptrdiff_t   prb_lvl [2];

extern  uint128_t   prb_mtx_rws [DIM_MAX];

extern  uint128_t   prb_tab [10000000];
extern  ptrdiff_t   prb_tab_num [10000000];
extern  bool        prb_tab_rdy [10000000];
extern  ptrdiff_t   prb_tab_ind;

However, the fact that I don't get an improvement of approx. 4x means that it introduces an overhead, which I guess goes from 2x to 1.5x. Part of the overhead is unavoidable (creating and destroying the threads), but there's some new overhead due to the facts that OpenMP doesn't allow to break from a parallel loop and that I added an if to each iteration, and I would like to get rid of it if possible.

Is there any other optimization that I could apply? Maybe using pthreads instead.

Should I bother editing some assembly?

I'm using GCC 9 with -O3 -flto (among others).

EDIT:

CPU: i7-5775C

But I plan to use other x64 CPUs with more cores.

Solution

You can coalesce k into bit tables and then do comparisons 64 at a time. If an entry in the main tables change, recompute that bit in the bit table.

If different queries use different M_RWS or P_LVL or something, then you'd need separate caches for separate search inputs. Or rebuild the cache for their current values, if you do multiple queries between changes. But hopefully that's not the case, otherwise the all-caps names are misleading.

Set up k as a bit table

#define KSZ (10000000/64 + !!(10000000 % 63))
static uint64_t k[KSZ];

void init_k(void){
  // We can split this up to minimize cache misses, see below
  for (size_t i;i<10000000;++i)
    k[i/64] |= (uint64_t)((!!T_RDY[i]) & (!(~T_RWS[i] & M_RWS)) &((T_NUM[i] + P_LVL) <= P_LEN) ) << (i&63);
}

You can find the bit-index into k by searching for a non-zero 64-bit chunk, then using a bitscan to find the bit within that chunk:

size_t k2index(void){
  size_t i;
  for (i=0; i<KSZ;++i)
    if (k[i]) break;
  return 64 * i + __builtin_ctzll(k[i]);
}

You may want to split up your data reads so that you get sequential data access (each table is over 40=80MB as described) and don't get a cache miss on every single iteration.

#define KSZ (10000000/64 + !!(10000000%63))
static uint64_t k[KSZ], k0[KSZ], k1[KSZ]; //use calloc instead?

void init_k(void){
  //I split these up to minimize cache misses
  for (size_t i;i<10000000;++i)
    k[i/64] |= (uint64_t)(!!T_RDY[i]) << (i&63);
  for (size_t i;i<10000000;++i)
    k0[i/64] |= (uint64_t)(!(~T_RWS[i] & M_RWS)) << (i&63);
  for (size_t i;i<10000000;++i)
    k1[i/64] |= (uint64_t)((T_NUM[i] + P_LVL) <= P_LEN) << (i&63);

  //now combine them 64 bits at a time
  for (size_t i;i<KSZ;++i)
    k[i] &= k0[i];
  for (size_t i;i<KSZ;++i)
    k[i] &= k1[i];
}

If you split it up like this, you could also initialize (some of) them when you set up your other tables. Or if the tables updated, you could update the k value as well.