This is the approach I'm using for parallelization:
n/N
numbers, ranging from index p*n/N
to (p+1)*n/N-1
.Using C++/OpenMP for parallelization. Before calling the function, I set the maximum number of threads with omp_set_num_threads()
.
// p is thread ID, N is total number of threads, n is the array size
#define LOWER_BOUND(p, N, n) p == 0 ? 2 : (p * n / N)
#define UPPER_BOUND(p, N, n) p == (N - 1) ? (n - 1) : (p + 1) * (n / N) - 1
size_t parallel_soe(std::vector<std::atomic<bool>>& A) {
size_t n = A.size();
size_t sqrt_n = sqrt(n);
for (size_t i = 2; i <= sqrt_n; i++) {
if (A[i] == true) continue;
#pragma omp parallel
{
uint16_t p = omp_get_thread_num();
uint16_t N = omp_get_num_threads();
size_t lower_bound = LOWER_BOUND(p, N, n);
size_t upper_bound = UPPER_BOUND(p, N, n);
size_t smallest_multiple = std::max(i * i, lower_bound);
size_t remainder = smallest_multiple % i;
if (remainder) smallest_multiple += i - remainder;
for (size_t j = smallest_multiple; j <= upper_bound; j += i)
A[j] = true;
}
}
// Count number of primes
size_t prime_count = 0;
#pragma omp parallel
{
uint16_t p = omp_get_thread_num();
uint16_t N = omp_get_num_threads();
size_t lower_bound = LOWER_BOUND(p, N, n);
size_t upper_bound = UPPER_BOUND(p, N, n);
size_t count = 0;
for (size_t i = lower_bound; i <= upper_bound; i++)
if (A[i] == false)
count++;
#pragma omp atomic
prime_count += count;
}
return prime_count;
}
The value returned by the function is correct when maximum threads are set to 2, 4, 8, 10, and 16, but not for 6, 12, and 14. I'm running it on a 4 core Intel i5.
This is my output log:
Finding primes under: 10000
================================
[2-parallel] Found 1229 primes in 542 microseconds
[4-parallel] Found 1229 primes in 3173 microseconds
[6-parallel] Found 1228 primes in 2353 microseconds
[8-parallel] Found 1229 primes in 3600 microseconds
[10-parallel] Found 1229 primes in 3227 microseconds
[12-parallel] Found 1227 primes in 2778 microseconds
[14-parallel] Found 1226 primes in 2248 microseconds
[16-parallel] Found 1229 primes in 2320 microseconds
Finding primes under: 100000
================================
[2-parallel] Found 9592 primes in 5186 microseconds
[4-parallel] Found 9592 primes in 10351 microseconds
[6-parallel] Found 9591 primes in 9859 microseconds
[8-parallel] Found 9592 primes in 8500 microseconds
[10-parallel] Found 9592 primes in 12294 microseconds
[12-parallel] Found 9591 primes in 8300 microseconds
[14-parallel] Found 9582 primes in 9252 microseconds
[16-parallel] Found 9592 primes in 8557 microseconds
Finding primes under: 1000000
================================
[2-parallel] Found 78498 primes in 36091 microseconds
[4-parallel] Found 78498 primes in 43570 microseconds
[6-parallel] Found 78498 primes in 48176 microseconds
[8-parallel] Found 78498 primes in 44201 microseconds
[10-parallel] Found 78498 primes in 43645 microseconds
[12-parallel] Found 78498 primes in 49175 microseconds
[14-parallel] Found 78494 primes in 47411 microseconds
[16-parallel] Found 78498 primes in 53602 microseconds
I figured it out. It had nothing to do with parallel execution, I just had a calculation error when computing the bounds.
This is the correct formula for computing lower bound:
#define LOWER_BOUND(p, N, n) p == 0 ? 2 : p * (n / N)
Notice the paranthesis around n/n
.