Multithreaded push_back to std::vector: mutex, enlarge and edit in place, or create a vector for results and push it back?

Let's say we have an std::vector<Element> vec, which already has 200 elements. Then, we want to add count elements to it, and each element is created on the basis of a random old one (from the first 200):

for (int index = 0; index < count; index++) {
  vec.push_back(Element(vec[someRandomIndex(0, 200), ...]));
}

count is a very large number, let's say 200 as well, and the constructor of Element isn't trivial, so there is a benefit of making the push_back() of new Elements parallel. I see three ways to do this:

Resize vec and modify the elements in-place

vec.resize(vec.size() + count);
// this code will be split into chunks and each chunk 
// will be processed in a separate thread. No overlaps, so no data race. 
for (int index = 0; index < count; index++) {
  vec[200+index] = Element(vec[someRandomIndex(0, 200), ...]);
}

Create a vector of size count for the result, edit the elements of the vector in chunks and then push_back() the whole result vector to vec. This uses extra memory and the movement of the result vector to vec is linear in complexity.
Make an std::mutex for vec and put the push_back() call in a critical section. I prefer this method the least, as each write will be preceded by a read, in order to obtain a random old element for the creation of a new one.

Which technique will cause the least overhead? Is there a common practice for this? The number 200 is only an example and is likely to grow bigger.

Solution

Don't try to guess what might be faster, write a benchmark and measure. There are some common patterns that often improve the performance, e.g. passing by const BigBoy& elem to avoid unnecessary copies, vec.reserve(KNOWN_SIZE); to avoid reallocations, memoization etc. But in your case we are talking about subtle differences in times needed to create new elements, access elements (potentially random?) already in the vector, and increasing the size of the vector.

So, write a benchmark, and measure!

#include <future>
#include <random>
#include <string>
#include <vector>

#include <benchmark/benchmark.h>

// also check with 200 and see the difference
constexpr int HALF_SIZE = 10000;
// like you said, string with more than 200 characters
constexpr int STR_LEN = 1000;
// usually when number of workers exceed HW concurrency performance degenerates
constexpr int NUM_WORKERS = 4;

struct BigStructure {
  BigStructure() : str{} {}
  BigStructure(size_t num) : str(num, 'c') {}
  std::string str;
};

static void SerialVersion(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<> distrib(0, HALF_SIZE - 1);
  std::vector<BigStructure> vec(HALF_SIZE);
  for (int i = 0; i < HALF_SIZE; i++) {
    // diversify the input a bit, but keep similar size
    if (i%2) vec[i] = BigStructure{STR_LEN};
    else vec[i] = BigStructure{STR_LEN-1};
  }
  vec.resize(2 * HALF_SIZE);
  for (auto _ : state) {
    for (int i = 0; i < HALF_SIZE; i++) {
      // also try to compare version without randomized indices
      vec[i] = BigStructure{vec[distrib(gen)].str.size()};
    }
    benchmark::DoNotOptimize(vec);
  }
}
BENCHMARK(SerialVersion);

static void ThreadVersion(benchmark::State &state) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<> distrib(0, HALF_SIZE - 1);
  std::vector<BigStructure> vec(HALF_SIZE);
  for (int i = 0; i < HALF_SIZE; i++) {
    if (i%2) vec[i] = BigStructure{STR_LEN};
    else vec[i] = BigStructure{STR_LEN-1};
  }
  vec.resize(2 * HALF_SIZE);
  auto worker = [&vec, &distrib, &gen](int id) {
    constexpr int section = HALF_SIZE / NUM_WORKERS;
    int idxBegin = section * id;
    int idxEnd = idxBegin + section;
    for (int i = idxBegin; i < idxEnd; i++) {
      vec[HALF_SIZE + i] = BigStructure{vec[distrib(gen)].str.size()};
    }
  };
  for (auto _ : state) {
    std::vector<std::future<void>> workers;
    for (int i = 0; i < NUM_WORKERS; i++) {
      workers.push_back(
          std::async(std::launch::async, [i, &worker]() { worker(i); }));
    }
    benchmark::DoNotOptimize(vec);
  }
}
BENCHMARK(ThreadVersion);

On my PC there is no significant difference, and often threaded version is slightly slower. If I increase the number of elements in the vector to 1e6, threaded version becomes 3x faster. If I can avoid accessing random vector elements, it's also faster. On your older machine the benchmark might yield different results.

Google benchmark github.