I have this parallel region written in OpenMp:
std::vector<T> sharedResult;
#pragma omp parallel
{
std::vector<T> result;
#pragma omp for nowait
for(int i=0; i<n; i++){
//fill result
}
#pragma omp critical{
sharedResult.insert(sharedResult.end(), result.begin(), result.end());
}
#pramga omp barrier
#pragma omp for nowait
for(size_t i=0; i<sharedResult.size(); i++){
foo(sharedResult[i]);
}
...
}
I'm afraid that the #pragma omp barrier
is necessary. The reason I think is that otherwise when a thread hit the last #pragma omp for
, sharedResult.size()
at that moment is still not in his final state (obtained when the previous parallel for is finished). Notice that unfortunately sharedResult
's size is previously unknown.
Unfortunately, I've noticed that this barrier generates a big overhead, i.e. one particular iteration is more expensive than all the others, so all the threads have to wait for the thread which executes that iteration. This can be considered as load imbalance, but I didn't find any solution to solve this.
So my question is: is there any way to start the last parallel for without waiting that the previous one is completed or there is seriously no way to improve this?
I would agree that the barrier is necessary. I see several ways out, with increasing complexity and likely increasing efficiency:
Post a task for each result element:
#pragma omp parallel
{
std::vector<T> result;
#pragma omp for nowait
for(int i=0; i<n; i++){
//fill result
}
// would prefer a range based loop here, but
// there seem to be issues with passing references
// to tasks in certain compilers
for(size_t i=0; i<result.size(); i++){
{
#pragma omp task
foo(result[i]);
}
}
You could even post the task within the initial loop. If there are too many tasks, you might get a significant overhead.
Now this one is trickier - in particular you need to distinguish between result queue empty and all threads completing their first loop.
std::vector<T> sharedResult;
int threadsBusy;
size_t resultIndex = 0;
#pragma omp parallel
{
#pragma omp single
threadsBusy = omp_num_threads();
std::vector<T> result;
#pragma omp for nowait
for(int i=0; i<n; i++){
//fill result
}
#pragma omp critical
{
sharedResult.insert(sharedResult.end(), result.begin(), result.end());
threadsBusy--;
}
do {
bool hasResult, allThreadsDone;
// We need a copy here as the vector may be resized
// and elements may become invalid by insertion
T myResult;
#pragma omp critical
{
if (resultIndex < sharedResult.size()) {
resultIndex++;
hasResult = true;
myResult = sharedResult[myResult];
} else {
hasResult = false;
}
allThreadsDone = threadsBusy == 0;
}
if (hasResult) {
foo(myResult);
} else {
if (allThreadsDone) {
break;
}
// If we just continue here, we will spin on the mutex
// Unfortunately there are no condition variables in OpenMP
// So instead we go for a quick nap as a compromise
// Feel free to tune this accordingly
std::this_thread::sleep_for(10ms);
}
} while (true);
}
Note: Usually I test the code I post here, but I couldn't due to the lack of a complete example.
Finally, you could run parallel for loops multiple times for those results that are already done. However that has a number of issues. First, each worksharing region must be encountered by all threads even by the ones that complete the first one late. So you would have to keep track of the loops you run. Also the loop bound needs to be the same for each thread - and you must only read sharedResult.size()
in a critical section. So you have to read that beforehand to a shared variable by one thread in a critical section, but wait with all threads until it is properly read. Further you would have to use dynamic scheduling, otherwise it you would likely use static scheduling and you will wait on the threads that complete last anyway. You edited example does neither of these things. I wouldn't take it for granted that a for nowait schedule(dynamic)
can complete before all threads in a team enter it (but it works with libgomp). All things considered, I wouldn't really go there.