I have some code in the following github repo: https://github.com/2005m/kit/blob/master/src/dup.c
The loop(s) that I would like to parallelise with OpenMP start at line 244:
for (R_xlen_t i = 0; i < len_i; ++i) {
R_xlen_t key = 0;
for (R_xlen_t j = 0; j < len_x; ++j) {
key ^= HASH(((intptr_t) px[i+j*len_i] & 0xffffffff) ^ 0,K)*97;
}
id = HASH(key, K);
while (h[id]) {
for (R_xlen_t j = 0; j < len_x; ++j) {
if (px[h[id]-1+j*len_i] != px[i+j*len_i]) {
goto labelms1;
}
}
pans[i] = 1; goto labelms2;
labelms1:;
id++; id %= M;
}
h[id] = (int) i + 1;
pans[i] = 0;
count++;
labelms2:;
}
The above code is part of function that identifies duplicated rows in a matrix using a hash function.
So far I have tried to add the following:
#pragma omp parallel for num_threads(6) private(j) shared(h, pans) reduction(+:count)
When I run the code multiple times, it is massively faster, but sometimes I get the right answer sometimes not. I think I need to figure out how to avoid having 2 threads executing the last lines with h and pans at the same time.
After searching for some time and testing, it looks like the below code works for me.
#pragma omp parallel for num_threads(4) private(j,id,key) shared(h, pans,px) reduction(+:count)
for (R_xlen_t i = 0; i < len_i; ++i) {
R_xlen_t key = 0;
for (R_xlen_t j = 0; j < len_x; ++j) {
key ^= HASH(((intptr_t) px[i+j*len_i] & 0xffffffff),K)*97;
}
id = HASH(key, K);
while (h[id]) {
for (R_xlen_t j = 0; j < len_x; ++j) {
if (px[h[id]-1+j*len_i] != px[i+j*len_i]) {
goto labelms1;
}
}
pans[i] = 1; goto labelms2;
labelms1:;
id++; id %= M;
}
#pragma omp critical
{
if (h[id] == 0) {
h[id] = (int) i + 1;
pans[i] = 0;
count++;
} else {
pans[i] = 0;
}
}
labelms2:;
}
You should refactor the code and split the function into two parts, first to calculate row hashes, and the second which will check for collisions.
Your second problem might be that the hash table uses open addressing. Open addressing doesn't work well when you have lots of collisions because it degrades the hash table performance to linear (O(N)
, instead of O(1)
, where N
is the number of rows). Performance wise, it would be better to rewrite the hash table to use buckets with lists of items, whether with a dynamic array or a linked list (I've used a dynamic array because it's easy to pass it to OpenMP later).
In any case, the part where you calculate hashes should be parallelized:
// trivial to parallelize
for (int i = 0; i < len_i; i++)
{
// calculate_row_hash must be pure/thread safe
// (note that row_hash is not a hash table, it's merely
// a precomputed hash value)
row_hash[i] = calculate_row_hash(px, i);
}
Once you have the hashes precalculated, you can build the hash table on a single thread. This part is amortized O(N)
in complexity unless the hash table uses open addressing:
// single threaded insert
for (int i = 0; i < len_i; i++)
{
// not thread-safe
hash_table_insert(&ht, row_hash(i), i);
}
hash_table_insert
should be a function which inserts the pair row_hash(i), i
into the hash table, something like (NOTE: this was written out of my head, not tested at all, it's about the concept):
// fixed number of buckets for simplicity
typedef struct
{
hashbucket_t * buckets[HASH_SIZE];
}
hashtable_t;
typedef struct
{
hashentry_t * entries;
size_t capacity;
size_t count;
}
hashbucket_t;
typedef struct
{
int hash;
int value;
}
hashentry_t;
// quick and dirty chained hashtable, no error checking for malloc/realloc
void hash_table_insert(hashtable_t * ht, int hash, int value)
{
int bucket_idx = hash % HASH_SIZE;
hashbucket_t * bucket = ht->buckets[bucket_idx];
if (bucket == NULL) // bucket doesn't exist?
{
ht->buckets[bucket_idx] = bucket = malloc(sizeof *bucket);
bucket->count = 0;
bucket->capacity = 4;
bucket->entries = malloc(sizeof *bucket->entries * bucket->capacity );
}
bucket->count++;
// check if we need to grow the bucket
if (bucket->count >= bucket->capacity)
{
bucket->capacity *= 2; // grow bucket capacity x2
bucket->entries = realloc(bucket->entries,
sizeof *bucket->entries * bucket->capacity);
}
// add entry to the end of the list
bucket->entries[bucket->count - 1] = { hash, value };
}
Once you have the hash-table precalculated, you can search for dupes using something like:
for (int i = 0; i < len_i; i++)
{
// skip row, if we already checked it
if (is_duplicate[i])
continue;
// find the bucket
int bucket_idx = row_hash[i] % HASH_SIZE;
hashbucket_t * bucket = ht->buckets[bucket_idx];
if (bucket->count == 1) // single hash, certainly not a dupe
continue;
// process all entries in parallel
#pragma omp parallel for
for (int j = 0; j < bucket->count; j++)
{
hashentry_t * entry = &bucket->entries[j];
int row = entry->value;
// current row is obviously in this list,
// that's not a duplicate
if (row == i || is_duplicate[row])
continue;
// if we have too few buckets, this might be a hash collision
// so double check the actual hash value and skip
// if it's different
if (entry->hash != row_hash[i])
continue;
// at this point, take the slow path
// (are_rows_equal must be pure/thread safe)
if (are_rows_equal(i, row))
{
is_duplicate[row] = true;
is_duplicate[i] = true;
}
}
}
Where are_rows_equal
does the actual comparison of row items:
// Returns true if rows with indices 'row_a' and 'row_b'
// contain exactly the same items.
bool are_rows_equal(int row_a, int row_b)
{
// check individual row items
for (int j = 0; j < len_x; ++j)
if (px[row_a + j*len_i] != px[row_b + j*len_i])
return false;
// if we're here, rows are equal
return true;
}
If there are many duplicate rows, are_rows_equal
will be heavily parallelized. If there are few duplicate rows, bucket->count
will be 1
so the loop will skip most of the iterations. This presumes that you have a good hash function and a large enough hash table.
This is the general idea, it will simply set a flag (bool is_duplicate[N]
) whenever it finds a duplicate, but you should be able to get more information it to your needs.
Aside, I doubt that the R_xlen_t
typedef is needed. And casting the input value for the hash function to intptr_t
is quite suspicious, to say the least.