Parallel version of the `std::generate` performs worse than the sequential one

I'm trying to parallelize some old code using the Execution Policy from the C++ 17. My sample code is below:

#include <cstdlib>
#include <chrono>
#include <iostream>
#include <algorithm>
#include <execution>
#include <vector>

using Clock = std::chrono::high_resolution_clock;
using Duration = std::chrono::duration<double>;

constexpr auto NUM = 100'000'000U;

double func()
{
  return rand();
}

int main()
{
  std::vector<double> v(NUM);
  // ------ feature testing
  std::cout << "__cpp_lib_execution         : " << __cpp_lib_execution << std::endl;
  std::cout << "__cpp_lib_parallel_algorithm: " << __cpp_lib_parallel_algorithm << std::endl;
  // ------ fill the vector with random numbers sequentially
  auto const startTime1 = Clock::now();
  std::generate(std::execution::seq, v.begin(), v.end(), func);
  Duration const elapsed1 = Clock::now() - startTime1;
  std::cout << "std::execution::seq: " << elapsed1.count() << " sec." << std::endl;
  // ------ fill the vector with random numbers in parallel
  auto const startTime2 = Clock::now();
  std::generate(std::execution::par, v.begin(), v.end(), func);
  Duration const elapsed2 = Clock::now() - startTime2;
  std::cout << "std::execution::par: " << elapsed2.count() << " sec." << std::endl;
}

The program output on my Linux desktop:

__cpp_lib_execution         : 201902
__cpp_lib_parallel_algorithm: 201603
std::execution::seq: 0.971162 sec.
std::execution::par: 25.0349 sec.

Why does the parallel version performs 25 times worse than the sequential one?

Compiler: g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0

Solution

The thread-safety of rand is implementation-defined. Which means either:

Your code is wrong in the parallel case, or
It's effectively serial, with a highly contended lock, which would dramatically increase the overhead in the parallel case and get incredibly poor performance.

Based on your results, I'm guessing #2 applies, but it could be both.

Either way, the answer is: rand is a terrible test case for parallelism.