Is it possible to apply OpenMP?

I have some code in the following github repo: https://github.com/2005m/kit/blob/master/src/dup.c

The loop(s) that I would like to parallelise with OpenMP start at line 244:

    for (R_xlen_t i = 0; i < len_i; ++i) {
      R_xlen_t key = 0;
      for (R_xlen_t j = 0; j < len_x; ++j) {
        key ^= HASH(((intptr_t) px[i+j*len_i] & 0xffffffff) ^ 0,K)*97;
      }
      id = HASH(key, K);
      while (h[id]) {
        for (R_xlen_t j = 0; j < len_x; ++j) {
          if (px[h[id]-1+j*len_i] != px[i+j*len_i]) {
            goto labelms1;
          }
        }
        pans[i] = 1; goto labelms2;
        labelms1:;
        id++; id %= M;
      }
      h[id] = (int) i + 1;
      pans[i] = 0;
      count++;
      labelms2:;
    }

The above code is part of function that identifies duplicated rows in a matrix using a hash function.

So far I have tried to add the following:

#pragma omp parallel for num_threads(6) private(j) shared(h, pans) reduction(+:count)

When I run the code multiple times, it is massively faster, but sometimes I get the right answer sometimes not. I think I need to figure out how to avoid having 2 threads executing the last lines with h and pans at the same time.

After searching for some time and testing, it looks like the below code works for me.

#pragma omp parallel for num_threads(4) private(j,id,key) shared(h, pans,px) reduction(+:count)
for (R_xlen_t i = 0; i < len_i; ++i) {
      R_xlen_t key = 0;
      for (R_xlen_t j = 0; j < len_x; ++j) {
        key ^= HASH(((intptr_t) px[i+j*len_i] & 0xffffffff),K)*97;
      }
      id = HASH(key, K);
      while (h[id]) {
        for (R_xlen_t j = 0; j < len_x; ++j) {
          if (px[h[id]-1+j*len_i] != px[i+j*len_i]) {
            goto labelms1;
          }
        }
        pans[i] = 1; goto labelms2;
        labelms1:;
        id++; id %= M;
      }
      #pragma omp critical
      {
        if (h[id] == 0) {
          h[id] = (int) i + 1;
          pans[i] = 0;
          count++;
        } else {
          pans[i] = 0;
        }
      }
      labelms2:;
    }

Solution

You should refactor the code and split the function into two parts, first to calculate row hashes, and the second which will check for collisions.

Your second problem might be that the hash table uses open addressing. Open addressing doesn't work well when you have lots of collisions because it degrades the hash table performance to linear (O(N), instead of O(1), where N is the number of rows). Performance wise, it would be better to rewrite the hash table to use buckets with lists of items, whether with a dynamic array or a linked list (I've used a dynamic array because it's easy to pass it to OpenMP later).

In any case, the part where you calculate hashes should be parallelized:

// trivial to parallelize
for (int i = 0; i < len_i; i++)
{
    // calculate_row_hash must be pure/thread safe
    // (note that row_hash is not a hash table, it's merely
    // a precomputed hash value)
    row_hash[i] = calculate_row_hash(px, i);
}

Once you have the hashes precalculated, you can build the hash table on a single thread. This part is amortized O(N) in complexity unless the hash table uses open addressing:

// single threaded insert
for (int i = 0; i < len_i; i++)
{
    // not thread-safe
    hash_table_insert(&ht, row_hash(i), i);
}

hash_table_insert should be a function which inserts the pair row_hash(i), i into the hash table, something like (NOTE: this was written out of my head, not tested at all, it's about the concept):

// fixed number of buckets for simplicity
typedef struct 
{
    hashbucket_t * buckets[HASH_SIZE];
}
hashtable_t;

typedef struct 
{
    hashentry_t * entries;
    size_t capacity;
    size_t count;
}
hashbucket_t;

typedef struct
{
    int hash;
    int value;
}
hashentry_t;

// quick and dirty chained hashtable, no error checking for malloc/realloc
void hash_table_insert(hashtable_t * ht, int hash, int value) 
{
     int bucket_idx = hash % HASH_SIZE;
     hashbucket_t * bucket = ht->buckets[bucket_idx];
     if (bucket == NULL) // bucket doesn't exist?
     {
          ht->buckets[bucket_idx] = bucket = malloc(sizeof *bucket);
          bucket->count = 0;
          bucket->capacity = 4;
          bucket->entries = malloc(sizeof *bucket->entries * bucket->capacity );
     }
     
     bucket->count++;

     // check if we need to grow the bucket
     if (bucket->count >= bucket->capacity)
     {
         bucket->capacity *= 2;  // grow bucket capacity x2
         bucket->entries = realloc(bucket->entries, 
             sizeof *bucket->entries * bucket->capacity);
     }

     // add entry to the end of the list
     bucket->entries[bucket->count - 1] = { hash, value };
}

Once you have the hash-table precalculated, you can search for dupes using something like:

for (int i = 0; i < len_i; i++)
{
    // skip row, if we already checked it
    if (is_duplicate[i])
        continue;

    // find the bucket
    int bucket_idx = row_hash[i] % HASH_SIZE;
    hashbucket_t * bucket = ht->buckets[bucket_idx];
    if (bucket->count == 1) // single hash, certainly not a dupe
        continue;

    // process all entries in parallel
    #pragma omp parallel for 
    for (int j = 0; j < bucket->count; j++)
    {
        hashentry_t * entry = &bucket->entries[j];
       
        int row = entry->value;

        // current row is obviously in this list,
        // that's not a duplicate
        if (row == i || is_duplicate[row])
            continue;

        // if we have too few buckets, this might be a hash collision
        // so double check the actual hash value and skip
        // if it's different
        if (entry->hash != row_hash[i])
            continue;
        
        // at this point, take the slow path
        // (are_rows_equal must be pure/thread safe)
        if (are_rows_equal(i, row))
        {
             is_duplicate[row] = true;
             is_duplicate[i] = true;
        }
    }
}

Where are_rows_equal does the actual comparison of row items:

// Returns true if rows with indices 'row_a' and 'row_b'
// contain exactly the same items.
bool are_rows_equal(int row_a, int row_b)
{
     // check individual row items
     for (int j = 0; j < len_x; ++j)
         if (px[row_a + j*len_i] != px[row_b + j*len_i])
             return false;
     
     // if we're here, rows are equal
     return true;
}

If there are many duplicate rows, are_rows_equal will be heavily parallelized. If there are few duplicate rows, bucket->count will be 1 so the loop will skip most of the iterations. This presumes that you have a good hash function and a large enough hash table.

This is the general idea, it will simply set a flag (bool is_duplicate[N]) whenever it finds a duplicate, but you should be able to get more information it to your needs.

Aside, I doubt that the R_xlen_t typedef is needed. And casting the input value for the hash function to intptr_t is quite suspicious, to say the least.