I'm trying to get into OpenMP and wrote up a small piece of code to get a feel for what to expect in terms of speedup:
#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>
void SingleThreaded(std::vector<float> &weights, int size)
{
auto totalWeight = 0.0f;
for (int index = 0; index < size; index++)
{
totalWeight += weights[index];
}
for (int index = 0; index < size; index++)
{
weights[index] /= totalWeight;
}
}
void MultiThreaded(std::vector<float> &weights, int size)
{
auto totalWeight = 0.0f;
#pragma omp parallel shared(weights, size, totalWeight) default(none)
{
// clang-format off
#pragma omp for reduction(+ : totalWeight)
// clang-format on
for (int index = 0; index < size; index++)
{
totalWeight += weights[index];
}
#pragma omp for
for (int index = 0; index < size; index++)
{
weights[index] /= totalWeight;
}
}
}
float TimeIt(std::function<void(void)> function)
{
auto startTime = std::chrono::high_resolution_clock::now().time_since_epoch();
function();
auto endTime = std::chrono::high_resolution_clock::now().time_since_epoch();
std::chrono::duration<float> duration = endTime - startTime;
return duration.count();
}
int main(int argc, char *argv[])
{
std::vector<float> weights(1 << 24);
std::srand(std::random_device{}());
std::generate(weights.begin(), weights.end(), []()
{ return std::rand() / static_cast<float>(RAND_MAX); });
for (int size = 1; size <= weights.size(); size <<= 1)
{
auto singleThreadedDuration = TimeIt(std::bind(SingleThreaded, std::ref(weights), size));
auto multiThreadedDuration = TimeIt(std::bind(MultiThreaded, std::ref(weights), size));
std::cout << "Size: " << size << std::endl;
std::cout << "Speed up: " << singleThreadedDuration / multiThreadedDuration << std::endl;
}
}
I compiled and ran the above code with MinGW g++ on Win10 like so:
g++ -O3 -static -fopenmp OpenMP.cpp; ./a.exe
The output (see below) shows a maximum speedup of around 4.2 at a vector size of 524288. That means that the multi-threaded code ran 4.2 times faster than the single-threaded code for a vector size of 524288.
Size: 1
Speedup: 0.00614035
Size: 2
Speedup: 0.00138696
Size: 4
Speedup: 0.00264201
Size: 8
Speedup: 0.00324149
Size: 16
Speedup: 0.00316957
Size: 32
Speedup: 0.00315457
Size: 64
Speedup: 0.00297177
Size: 128
Speedup: 0.00569801
Size: 256
Speedup: 0.00596125
Size: 512
Speedup: 0.00979021
Size: 1024
Speedup: 0.019943
Size: 2048
Speedup: 0.0317662
Size: 4096
Speedup: 0.181818
Size: 8192
Speedup: 0.133713
Size: 16384
Speedup: 0.216568
Size: 32768
Speedup: 0.566396
Size: 65536
Speedup: 1.10169
Size: 131072
Speedup: 1.99395
Size: 262144
Speedup: 3.4772
Size: 524288
Speedup: 4.20111
Size: 1048576
Speedup: 2.82819
Size: 2097152
Speedup: 3.98878
Size: 4194304
Speedup: 4.00481
Size: 8388608
Speedup: 2.91028
Size: 16777216
Speedup: 3.85507
So my questions are:
cblas_sasum
and cblas_sscal
from a good BLAS implementation). It's quite possible that you're leaving a lot of single thread performance on the table at the moment.