How can I generate sorted uniformly distributed random numbers efficiently in C++?

I'd like to generate a large number, n, (i.e., n >= 1,000,000,000) of sorted and uniformly distributed random numbers in C++.

A first and simple approach I considered was to

sequentially generate n uniformly distributed numbers using an std::uniform_real_distribution<double>,
and then sort them using std::sort.

However, this takes several minutes.

A second and more sophisticated approach was to do parallelize the two steps as in:

template <typename T>
void computeUniformDistribution(std::vector<T>& elements)
{
    #pragma omp parallel
    {
        std::seed_seq seed{distribution_seed, static_cast<size_t>(omp_get_thread_num())};
        std::mt19937 prng = std::mt19937(seed);
        std::uniform_real_distribution<double> uniform_dist(0, std::numeric_limits<T>::max());

        #pragma omp for
        for (size_t i = 0; i < elements.size(); ++i)
        {
            elements[i] = static_cast<T>(uniform_dist(prng));
        }
    }

    std::sort(std::execution::par_unseq, elements.begin(), elements.end());
}

However, even this takes about about 30 seconds. Given that the generation of the uniformly distributed numbers takes only about 1.5 seconds, the bottleneck is still the sort phase.

Hence, I'd like to ask the following question: How can I efficiently generate uniformly distributed data in a sorted way?

Solution

There are ways to generate samples that are already sorted, but I think that it might be better to generate partially sorted samples.

Divide the output range into k buckets of equal width. The number of samples in each bucket will have multinomial distribution with equal probabilities. The slow method to sample the multinomial distribution is to generate n integers in [0, k). A more efficient method is to draw k Poisson samples with rate n/k conditioned on their sum not exceeding n, then add another n - sum samples using the slow way. Sampling the Poisson distribution is tricky to do perfectly, but when n/k is very large (as it will be here), the Poisson distribution is excellently approximated by rounding a normal distribution with mean and variance n/k. If that's unacceptable, the slow method does parallelize well.

Given the bucket counts, compute the prefix sums to find the bucket boundaries. For each bucket in parallel, generate the given number of samples within the bucketed range and sort them. If we choose n/k well, each bucket will almost certainly fit in L1 cache. For n = 1e9, I think I'd try k = 1e5 or k = 1e6.

Here's a sequential implementation. A little unpolished since we really need to avoid 2x oversampling the bucket boundaries, which are closed, but I'll leave that to you. I'm not familiar with OMP, but I think you can get a pretty good parallel implementation by adding a pragma to the for loop at the end of SortedUniformSamples.

#include <algorithm>
#include <cmath>
#include <iostream>
#include <numeric>
#include <random>
#include <span>
#include <vector>

template <typename Dist, typename Gen>
void SortedSamples(std::span<double> samples, Dist dist, Gen& gen) {
  for (double& sample : samples) {
    sample = dist(gen);
  }
  std::sort(samples.begin(), samples.end());
}

template <typename Gen>
void ApproxMultinomialSample(std::span<std::size_t> samples, std::size_t n,
                             Gen& gen) {
  double lambda = static_cast<double>(n) / samples.size();
  std::normal_distribution<double> approx_poisson{lambda, std::sqrt(lambda)};
  std::size_t sum;
  do {
    for (std::size_t& sample : samples) {
      sample = std::lrint(approx_poisson(gen));
    }
    sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0});
  } while (sum > n);
  std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1};
  for (; sum < n; sum++) {
    samples[uniform(gen)]++;
  }
}

template <typename Gen>
void SortedUniformSamples(std::span<double> samples, Gen& gen) {
  static constexpr std::size_t kTargetBucketSize = 1024;
  if (samples.size() < kTargetBucketSize) {
    SortedSamples(samples, std::uniform_real_distribution<double>{0, 1}, gen);
    return;
  }
  std::size_t num_buckets = samples.size() / kTargetBucketSize;
  std::vector<std::size_t> bucket_counts(num_buckets);
  ApproxMultinomialSample(bucket_counts, samples.size(), gen);
  std::vector<std::size_t> prefix_sums(num_buckets + 1);
  std::partial_sum(bucket_counts.begin(), bucket_counts.end(),
                   ++prefix_sums.begin());
  for (std::size_t i = 0; i < num_buckets; i++) {
    SortedSamples(std::span<double>{&samples[prefix_sums[i]],
                                    &samples[prefix_sums[i + 1]]},
                  std::uniform_real_distribution<double>{
                      static_cast<double>(i) / num_buckets,
                      static_cast<double>(i + 1) / num_buckets},
                  gen);
  }
}

int main() {
  std::vector<double> samples(100000000);
  std::default_random_engine gen;
  SortedUniformSamples(samples, gen);
  if (std::is_sorted(samples.begin(), samples.end())) {
    std::cout << "sorted\n";
  }
}

If your standard library has a high-quality implementation of poisson_distribution, you could also do this:

template <typename Gen>
void MultinomialSample(std::span<std::size_t> samples, std::size_t n,
                       Gen& gen) {
  double lambda = static_cast<double>(n) / samples.size();
  std::poisson_distribution<std::size_t> poisson{lambda};
  std::size_t sum;
  do {
    for (std::size_t& sample : samples) {
      sample = poisson(gen);
    }
    sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0});
  } while (sum > n);
  std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1};
  for (; sum < n; sum++) {
    samples[uniform(gen)]++;
  }
}