Why does the speedup I get by parallelizing with OpenMP decrease after a certain workload size?

I'm trying to get into OpenMP and wrote up a small piece of code to get a feel for what to expect in terms of speedup:

#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <vector>
#include <random>

void SingleThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

    for (int index = 0; index < size; index++)
    {
        totalWeight += weights[index];
    }

    for (int index = 0; index < size; index++)
    {
        weights[index] /= totalWeight;
    }
}

void MultiThreaded(std::vector<float> &weights, int size)
{
    auto totalWeight = 0.0f;

#pragma omp parallel shared(weights, size, totalWeight) default(none)
    {
        // clang-format off
#pragma omp for reduction(+ : totalWeight)
        // clang-format on
        for (int index = 0; index < size; index++)
        {
            totalWeight += weights[index];
        }

#pragma omp for
        for (int index = 0; index < size; index++)
        {
            weights[index] /= totalWeight;
        }
    }
}

float TimeIt(std::function<void(void)> function)
{
    auto startTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    function();
    auto endTime = std::chrono::high_resolution_clock::now().time_since_epoch();
    std::chrono::duration<float> duration = endTime - startTime;

    return duration.count();
}

int main(int argc, char *argv[])
{
    std::vector<float> weights(1 << 24);
    std::srand(std::random_device{}());
    std::generate(weights.begin(), weights.end(), []()
                  { return std::rand() / static_cast<float>(RAND_MAX); });

    for (int size = 1; size <= weights.size(); size <<= 1)
    {
        auto singleThreadedDuration = TimeIt(std::bind(SingleThreaded, std::ref(weights), size));
        auto multiThreadedDuration = TimeIt(std::bind(MultiThreaded, std::ref(weights), size));

        std::cout << "Size: " << size << std::endl;
        std::cout << "Speed up: " << singleThreadedDuration / multiThreadedDuration << std::endl;
    }
}

I compiled and ran the above code with MinGW g++ on Win10 like so:

g++ -O3 -static -fopenmp OpenMP.cpp; ./a.exe

The output (see below) shows a maximum speedup of around 4.2 at a vector size of 524288. That means that the multi-threaded code ran 4.2 times faster than the single-threaded code for a vector size of 524288.

Size: 1
Speedup: 0.00614035
Size: 2
Speedup: 0.00138696
Size: 4
Speedup: 0.00264201
Size: 8
Speedup: 0.00324149
Size: 16
Speedup: 0.00316957
Size: 32
Speedup: 0.00315457
Size: 64
Speedup: 0.00297177
Size: 128
Speedup: 0.00569801
Size: 256
Speedup: 0.00596125
Size: 512
Speedup: 0.00979021
Size: 1024
Speedup: 0.019943
Size: 2048
Speedup: 0.0317662
Size: 4096
Speedup: 0.181818
Size: 8192
Speedup: 0.133713
Size: 16384
Speedup: 0.216568
Size: 32768
Speedup: 0.566396
Size: 65536
Speedup: 1.10169
Size: 131072
Speedup: 1.99395
Size: 262144
Speedup: 3.4772
Size: 524288
Speedup: 4.20111
Size: 1048576
Speedup: 2.82819
Size: 2097152
Speedup: 3.98878
Size: 4194304
Speedup: 4.00481
Size: 8388608
Speedup: 2.91028
Size: 16777216
Speedup: 3.85507

So my questions are:

Why is the multi-threaded code slower for a smaller vector size? Is it purely because of the overhead of creating the threads and distributing the work or am I doing something wrong?
Why does the speedup I get decrease after a certain size?
What would be the best case speedup I could theoretically achieve on the CPU I used (i7 7700k)?
Does the distinction between physical CPU cores and logical CPU cores matter in terms of speedup?
Did I make any blatant mistakes in my code? Can I improve something?

Solution

I agree with your theory; it's likely the overhead of setting things up.
While the CPU cores on your processor have their own L1 and L2 caches, they all share an 8M L3 cache, and once the vector becomes too big to fit into that L3 cache, there is the risk of the threads mutually evicting each other's pages from the cache.
I assume by "logical core" you mean a hyperthread? Those cannot actually compute in parallel, they can merely "fill in" while the other thread is e.g. blocked waiting for memory. In cache effective, compute bound code, that could limit their potential for parallelism considerably.
I don't know to what extent your compiler vectorizes the code it compiles; I would benchmark the two functions you have against a fully vectorized implementation (e.g. using cblas_sasum and cblas_sscal from a good BLAS implementation). It's quite possible that you're leaving a lot of single thread performance on the table at the moment.