OpenMP quicksort implementation slower than sequential after randomization

I'm tinkering with OpenMP and trying to implement a parallelized version of quicksort.
I have implemented a version that takes as pivot always the first element, a parallelized version of it, a version that randomizes the pivot by choosing the median od three random elements, and a parallelized version it.
What bothers me is the fact that I get a good speed up with the first parallelization, whereas the second (despite being parallelized in the same way) is slower than the sequential counterpart.
In both cases I parallelize only the first invocation of the function, I know that I could parallelize more in depth in the three of recursions, but the point is that I expected to get the same speedup from the two parallelizations.

Here's the code snippet for the "naive" (no randomization) partitioning function:

int partition(vector<int>& v, int p, int q){
  int x = v[p];
  int i = p;
  for(int j = p+1; j <= q; j++){
    if(v[j] <= x){
      i++;
      swap(v[i], v[j]);
    }
  }
  swap(v[i], v[p]);
  return i;
}

This is the randomized partitioning function:

int rand_median(const vector<int>& v, int p, int q){
  int n1 = (rand() % (p - q)) + p;
  int n2 = (rand() % (p - q)) + p;
  int n3 = (rand() % (p - q)) + p;
  if((v[n1] <= v[n2] && v[n1] >= v[n3]) || (v[n1] <= v[n3] && v[n1] >= v[n2])) return n1;
  else if ((v[n2] <= v[n1] && v[n2] >= v[n3]) || (v[n2] <= v[n3] && v[n2] >= v[n1])) return n2;
  else return n3;
}

int rand_partition(vector<int>& v, int p, int q){
  int pivot = rand_median(v,p,q);
  swap(v[p], v[pivot]);
  int x = v[p];
  int i = p;
  for(int j = p+1; j <= q; j++){
    if(v[j] <= x){
      i++;
      swap(v[i], v[j]);
    }
  }
  swap(v[i], v[p]);
  return i;
}

Naive quicksort:

void quicksort(vector<int>& v, int s, int e){
  if(s >= e) return;
  int p = partition(v,s,e);
  quicksort(v,s,p-1);
  quicksort(v,p+1,e);
}

Parallelized naive quicksort:

void quick_par(vector<int>& v, int s, int e){
  if(s >= e) return;
  int p = partition(v,s,e);
  omp_set_num_threads(2);
#pragma omp parallel sections
  {
    #pragma omp section
    quicksort(v,s,p-1);
    #pragma omp section
    quicksort(v,p+1,e);
  }
}

Randomized quicksort:

void quick_rand(vector<int>& v, int s, int e){
  if(s >= e) return;
  int p = rand_partition(v,s,e);
  quick_rand(v,s,p-1);
  quick_rand(v,p+1,e);
}

Parallelized randomized quicksort:

void quick_par_rand(vector<int>& v, int s, int e){
  if(s >= e) return;
  int p = rand_partition(v,s,e);
  omp_set_num_threads(2);

#pragma omp parallel sections
  {
    #pragma omp section
    quick_rand(v,s,p-1);
    #pragma omp section
    quick_rand(v,p+1,e);
  }
}

And here are the results obtained with Google benchmark:

bench_ser       887282457 ns    887038659 ns           10 //naive quicksort
bench_par        10738723 ns     10734826 ns           70 //parallelized naive
bench_rand         613904 ns       613686 ns         1039 //randomized quicksort
bench_par_rand    3249751 ns      3248460 ns          213 //parallelized randomized
bench_sort         106110 ns       106074 ns         5952 //std::sort

As you can see the parallelized randomized version is slower than the sequential one. Here's the pastebin of the whole code I've used.

Solution

The parallel version bench_par_rand is incorrect: it uses the rand which is not thread safe. It results in a race condition. Thus, the results may not be random (a critical point needed for a fast quick-sort) and the code much slower since the threads will constantly try to modify the shared internal state (iterative seed) of the rand function. Consider using the C++11 random number generators if possible (one per thread).

A quick workaround could be to use the thread_local storage and the C++11 random number generators together and rename all rand() by safe_rand(). Here is an example:

thread_local std::uniform_int_distribution<int> distrib(0, RAND_MAX);
thread_local std::mt19937 rdGen;
thread_local auto safe_rand = std::bind(distrib, rdGen);

Using local variables rather than global thread_local (and using a specific uniform_int_distribution rather than modulus) improves performance although this is a bit more tedious to do.

Here are the timings:

Base version:
 - bench_rand: 582499 ns
 - bench_par_rand: 2765300 ns

Fixed version with thread_local:
 - bench_rand: 1109800 ns
 - bench_par_rand: 737799 ns

Fixed version with local variables:
 - bench_rand: 798699 ns
 - bench_par_rand: 572300 ns

The last fixed parallel version is much faster! However, the last fixed sequential version is also slower than before. I think this i due to a slower random generator. Finally, there is no cut-off method in your code (switch from quick-sort to a faster algorithm for small arrays). Thus, the cost of the random generator is largely exhibited.