c++optimization cryptography tbb uniform-distribution

Fastest way to bring a range [from, to] of 64-bit integers into pseudo-random order, with same results across all platforms?

Given some interval [a, b] of indices (64-bit unsigned integers), I would like to quickly obtain an array that contains all of these indices ordered according to a uniformly distributed hash function, appearing random but actually being the same on every system, regardless of the used C++ implementation.

The goal is to find highly optimized such methods. You may use shared memory parallelism via Intel's oneTBB to improve performance.

Something like

vector<uint64_t> distributeIndices(uint64_t from, uint64_t to) {
    unordered_set<uint64_t> uset;
    for (uint64_t i = from; i <= to; i++)
        uset.insert(i);
    return vector<uint64_t>(uset.begin(), uset.end());
}

would generate desired results if unordered_set<uint64_t> would always use the same and a pseudo-randomly distributed hash function on every implementation, both of which is not the case. It would also be an inefficient solution. TBB equivalent:

tbb::concurrent_vector<uint64_t> distributeIndices(uint64_t from, uint64_t to) {
    tbb::concurrent_unordered_set<uint64_t> uset;
    tbb::parallel_for(from, to + 1, [&uset](uint64_t i) {
        uset.insert(i);
    }); // NOTE: This is only to illustrate a parallel loop, sequential insertion is actually faster.
    return tbb::concurrent_vector<uint64_t>(uset.begin(), uset.end());
}

Note that distributeIndices(from, to) should return a random-looking permutation of {from, ..., to}.

Merely providing some hash functions is insufficient, and none of the answers at "Generating a deterministic int from another with no duplicates" actually answered that question.
Consider transform from this answer. Notably, a cyclic distribution is not a pseudo-random distribution:
1. Sorting {from, ..., to} w.r.t. (uint64_t a, uint64_t b) { return transform(a) < transform(b) }
  - distributeIndices(42, 42+99999999)[0, ..., 999] looks not random at all:
2. Sorting {from, ..., to} w.r.t. (uint64_t a, uint64_t b) { return transform(a) % n < transform(b) % n }
  - distributeIndices(42, 42+99999999)[0, ..., 999] looks not random at all:
3. Assigning each x in {from, ..., to} to transform(x - from) % n + from
  - distributeIndices(42, 42+99999999) happens to be bijective (since 100000000 and 39293 are coprime), but distributeIndices(42, 42+99999999)[0, ..., 999] looks not random at all:
  - distributeIndices(42, 42+3929299) is not bijective. It assigns only 100 different elements, cycling with a period of 100:
4. Assigning each x in {from, ..., to} to transform(x - from) + from
  - distributeIndices(42, 42+99999999) is not bijective, e.g. it assigns 3929375282657 > 42+99999999.
In particular, a linear congruential generator is not in general a bijection. But if you can make it such for each interval [from, to], while also hiding its cyclic nature, how?

Consequently, an answer should provide a specific hash function (and why it is fast and uniformly distributed), and how to efficiently utilize it in order to compute distributeIndices(from, to).

Again, it is crucial that distributeIndices(from, to) has the same result regardless of where it runs and what compiler it used, which must be guaranteed according to the C++ standard. But it is fine if for example distributeIndices(0,2) assigns a different index to 1 than distributeIndices(0,3) does.

Acceptable return types are std::vector, tbb::concurrent_vector, and dynamic arrays, consisting of elements of type uint64_t.
The function should perform well on ranges that include billions of indices.

[ In case you are curious on why this could be useful: Consider that there are different processes on different computing nodes, communicating via Message Passing Interface, and they should not send actual data (which is big), but only indices of data entries which they are processing. At the same time, the order to process data should be pseudo-randomized so that the progress speed is not "bouncing" much (which it does, when processing along the ordered indices). This is essential for reliable predictions of how long the overall computation will take. So every node must know which transformed index refers to which actual index, i.e. every node must compute the same result for distributeIndices(from, to). ]

The fastest correctly working solution wins the accepted answer.

No C/C++ code, no acceptable answer.
(Except when it is a proof that the problem cannot be solved efficiently.)

I will test solutions with GCC 11.3 -O3 on my old i7-3610QM laptop with 8 hardware threads on 100 million indices (i.e. distributeIndices(c, c + 99999999)), and may change accepted answers when future answers provide a better performing solution.

Testing code (run up to 10 times, pick fastest execution):

int main(int argc, char* argv[]) {
    uint64_t c = argc < 3 ? 42 : atoll(argv[1]);
    uint64_t s = argc < 3 ? 99999 : atoll(argv[2]); // 99999999 for performance testing
    for (unsigned i = 0; i < 10; i++) {
        chrono::time_point<chrono::system_clock> startTime = chrono::system_clock::now();
        auto indices = distributeIndices(c, c + s);
        chrono::microseconds dur = chrono::duration_cast<chrono::microseconds>(chrono::system_clock::now() - startTime);
        cout << durationStringMs(dur) << endl;
        // [... some checks ...]
#if 0 // bijectivity check
        set<uint64_t> m = set<uint64_t>(indices.begin(), indices.end());
        cout << "min: " << *m.begin() << " , max: " << *prev(m.end()) << ", #elements: " << m.size() << endl;
#endif
        cout << "required average: " << round((2.0L * c + s) / 2, 2) << endl;
        long double avg = accumulate(indices.begin(), indices.end(), __uint128_t(0)) / static_cast<long double>(indices.size());
        string sa = round(avg, 2);
        cout << "actual average:   " << sa << endl;
        auto printTrendlineHelpers = [](uint64_t minX, string avgX, uint64_t maxX, uint64_t minY, string avgY, uint64_t maxY) {
            cout << "Trendline helpers:" << endl;
            cout << "[max] " << minX << " " << maxY << " " << avgX << " " << maxY << " " << maxX << " " << maxY << endl;
            cout << "[avg] " << minX << " " << avgY << " " << avgX << " " << avgY << " " << maxX << " " << avgY << endl;
            cout << "[min] " << minX << " " << minY << " " << avgX << " " << minY << " " << maxX << " " << minY << endl;
        };
        // Print some plottable data, for e.g. https://www.rapidtables.com/tools/scatter-plot.html
        unsigned plotAmount = 2000;
        auto printPlotData = [&](uint64_t start, uint64_t end) {
            long double rng = static_cast<long double>(end - start);
            long double avg = accumulate(indices.begin() + start, indices.begin() + end, __uint128_t(0)) / rng;
            cout << "\ndistributeIndices(" << c << ", " << c << "+" << s << ")[" << start << ", ..., " << end - 1 << "]: (average " << round(avg, 2) << ")" << endl;
            stringstream ss;
            for (unsigned i = start; i < end; i++)
                ss << i << " " << indices[i] << (i + 1 == end ? "" : " ");
            cout << ss.str() << endl;
            printTrendlineHelpers(start, round(start + rng / 2, 2), end - 1, c, sa, c + s);
        };
        printPlotData(0, plotAmount); // front
        printPlotData(indices.size() / 2 - plotAmount / 2, indices.size() / 2 + plotAmount / 2); // middle
        printPlotData(indices.size() - plotAmount, indices.size()); // back
#if 1 // Print average course
        if (s >= 1000000)
            plotAmount *= 10;
        stringstream ss;
        for (uint64_t start = 0; start < indices.size(); start += plotAmount) {
            uint64_t end = min(start + plotAmount, indices.size());
            uint64_t i = start + (end - start) / 2;
            long double avg = accumulate(indices.begin() + start, indices.begin() + end, __uint128_t(0)) / static_cast<long double>(end - start);
            ss << i << " " << round(avg, 2) << (end == indices.size() ? "" : " ");
        }
        cout << "\nAverage course of distributeIndices(" << c << ", " << c << "+" << s << ") with slices of size " << plotAmount << ":\n" << ss.str() << endl;
        printTrendlineHelpers(c, sa, c + s, c, sa, c + s);
        break;
#endif
    }
    return 0;
}

Storage of the result (e.g. via static variables) is obviously not allowed.
uint64_t from and uint64_t to cannot be considered constexpr.

My two (unsuitable) examples would be 14482.83 ms (14 s 482.83 ms) and 186812.68 ms (3 min 6 s 812.68 ms).
The second approach seems terribly slow, but upon closer inspection it is the only one that on my system actually distributes the values:

unordered_set<uint64_t> variant:
e.g. 100000041, 100000040, 100000039, 100000038, 100000037, ... // bad
tbb::concurrent_unordered_set<uint64_t> variant:
e.g. 67108864, 33554432, 16777216, 83886080, 50331648, ... // well-distributed, but not looking random
- distributeIndices(42, 42+99999999)[0, ..., 999] with polynomial trendline:
  
  The above distribution looks ordered, not random.

An exemplary distribution suggesting randomness can be obtained from olegarch's answer, as of Apr 30, 2023.

// LCG params from: https://nuclear.llnl.gov/CNP/rng/rngman/node4.html
std::vector<uint64_t> distributeIndices(uint64_t lo, uint64_t hi) {
    uint64_t size = hi - lo + 1;
    std::vector<uint64_t> vec(size);
    for(uint64_t i = 0; i < size; i++)
        vec[i] = i + lo;
    uint64_t rnd = size ^ 0xBabeCafeFeedDad;
    for(uint64_t i = 0; i < size; i++) {
        rnd = rnd * 2862933555777941757ULL + 3037000493;
        uint64_t j = rnd % size;
        uint64_t tmp = vec[i]; vec[i] = vec[j]; vec[j] = tmp;
    }
    return std::move(vec);
}

Note that the solution is still incorrect since it doesn't provide uniform distributions for all ranges, as shown further below. It also does not utilize parallel computing, but it performs well: Computing 100 million indices took 3235.18 ms on my i7-3610QM.

front ; distributeIndices(42, 42+99999999)[0, ..., 1999] with polynomial trendline:
middle ; distributeIndices(42, 42+99999999)[49999000, ..., 50000999] with polynomial trendline:
back ; distributeIndices(42, 42+99999999)[99998000, ..., 99999999] with polynomial trendline:

The above distribution is looking random, even though the average seems to be a little low and bouncy in the beginning. Its global trend looks as follows.
average course of distributeIndices(42, 42+99999999) with polynomial trendline:

The polynomial trendline deviates up to 5% from the global average, so the distribution is not uniform.
And for some ranges, it gets worse:
average course of distributeIndices(0, 67108863) with polynomial trendline:
front ; distributeIndices(0, 67108863)[0, ..., 1999] with polynomial trendline:

This is clearly not an acceptable distribution.

An exemplary distribution with a flawless trendline can be obtained from Severin Pappadeux's answer, as of Apr 30, 2023. Following the suggestions, I added some parallelization.

uint64_t m = 0xd1342543de82ef95ULL; // taken from https://arxiv.org/pdf/2001.05304.pdf
uint64_t c = 0x1ULL;
inline auto lcg(uint64_t xi) -> uint64_t { // as LCG as it gets
    return m*xi + c;
}
inline auto cmp_lcg(uint64_t a, uint64_t b) -> bool {
    return lcg(a) < lcg(b);
}
auto distributeIndices(uint64_t from, uint64_t to) -> std::vector<uint64_t> {
    uint64_t size = to - from + 1;
    std::vector<uint64_t> z(size);
    tbb::parallel_for(uint64_t(0), size, [&](uint64_t i) {
        z[i] = from + i;
    }); // instead of std::iota(z.begin(), z.end(), from);
    tbb::parallel_sort(z.begin(), z.end(), cmp_lcg); // instead of std::sort(z.begin(), z.end(), cmp_lcg);
    return z;
}

To give an idea of performance boost via multithreading, computing 100 million indices on my i7-3610QM took 15925.91 ms sequentially and 3666.21 ms with parallelization (on 8 hardware threads).
On a computing cluster with Intel Xeon Platinum 8160 processors, I measured the (#cpu,duration[ms]) results (1,19174.65), (2,9862.29), (4,5580.47), (8,3402.05), (12,2119.28), (24,1606.78), and (48,1330.20).

It should also be noted, that the code is better optimized and runs much faster, when turning cmp_lcg into a lambda function, e.g. auto cmp_lcg = [](uint64_t a, uint64_t b) -> bool { return lcg(a) < lcg(b); };. This way, it performed best at 2608.15 ms on my i7-3610QM. Slightly even better performance can be reached when declaring global variables m and c as constexpr, or making them local or literals, which led to a duration of 2542.14 ms.

average course of distributeIndices(42, 42+99999999) with polynomial trendline:

But when looking at the actual distribution, it becomes apparent that it is not random, but ordered:
front ; distributeIndices(42, 42+99999999)[0, ..., 1999] with polynomial trendline:

The solution is incorrect regarding the task that distributions should appear to be random, but due to its mostly^✻ nice distributions, it is particularly useful for the previously mentioned MPI use case. So if there won't be any fully correct solutions, it will make an accepted answer as well — so long as there won't be given any plausible ranges where it has non-uniform distributions. Plausible here means, that values where the algorithm would run for at least several days, can be ignored.
^✻_{The factor 0xd1342543de82ef95 shows some weaknesses in the spectral test, and I haven't yet found a reason not to use 0x9e3779b97f4a7c15 instead.}

After all this, it should be clear what it means that the task is to combine

perceived randomness of a
uniform bijective distribution of
plausible arbitrary intervals, with
high performance.

I am very curious if there even exist any correct and well-performing solutions to the problem!
A negative answer to this question, with a corresponding proof, would of course also be acceptable.

Even better than distributeIndices(uint64_t, uint64_t) -> vector<uint64_t> would be an approach to not have to create a vector, but merely iterating the indices in pseudo-random order, but that would require each pseudo-random index to be efficiently computable from its actual index (without iterating all the elements before it). I would be surprised it that is possible, but I'd gladly be surprised. Such approaches are always considered better than the vector-constructing ones, and are compared amongst each other by the duration of iterating 100 million indices.

Solution

Overview

The following solution constructs a bijective function F that maps a range of integers onto itself. This function can be used to compute a pseudo-random index directly from a source index such that the resulting pseudo-random indices are a permutation of the source indices.

There are three ideas (all borrowed from cryptography) that taken together allow for the construction of such a function: 1) a Pseudo-Random Function Family (PRF), 2) a Feistel network, and 3) a format-preserving-encryption (FPE). While these ideas draw on well-studied cryptography concepts, I believe the end product is probably unqiue and should definitely be considered insecure.

The basic strategy is to encrypt the source index to produce the target index. The secret sauce is to design the encryption to be bijective and use a range of integers as the domain. I have dubbed this method feisty for the use of a Feistel network.

Pseudo Random Function Family

The first step in the construction is creating a PRF that returns a pseudo-random value given a 64-bit input. We can create this family using a single function that also takes a subkey parameter that is used to select the particular function to use. The canonical PRF example uses AES to produce a 128-bit pseudo-random value. We will use the following function which is more efficient to evaluate (although much less secure) and produces a 64-bit pseudo-random value. The s0 parameter is the source index and s1 parameter is the subkey.

uint64_t pseudo_random_function(uint64_t s0, uint64_t s1) {
    auto a = s0 + s1;
    a ^= a >> 12;
    a ^= a << 25;
    a ^= a >> 27;
    return a * 0x2545f4914f6cdd1dull;
}

This function can be used directly to order the source indices producing a pseudo-random permutation as in Severin Pappadeux's answer which is equivalent to constructing an FPE using a prefix cipher. The main difference is that this PRF produces more "random" looking results than using the linear congruent generator as shown in the following plot.

Feistel Network

Instead of using the PRF directly, we will apply a Feistel network that leverages the PRF as its round function. The two key advantages of the Feistel network is 1) the operation is guaranteed to be invertible (i.e. bijective) even if the round function is not, and 2) the number of output bits can be selected to be at most one or two more than the number of input bits making the encoding range of the network at most four times larger than the input domain. The minimum number of rounds for security applications is suggested to be three. The following class implements a balanced Feistel network.

template<class PRF>
class FeistelNetwork {
public:
    FeistelNetwork(int number_of_bits, int number_rounds, PRF&& prf)
        : shift_((1 + number_of_bits) / 2)
        , mask_((uint64_t{1} << shift_) - 1)
        , nrounds_(number_rounds)
        , prf_(std::forward<PRF>(prf)) {
    }

    auto encode(uint64_t msg) const {
        auto [left, right] = split(msg);
        for (auto i = 0; i < nrounds_; ++i)
            round(left, right, Rounds[i]);
        return combine(left, right);
    }

    auto decode(uint64_t msg) const {
        auto [left, right] = split(msg);
        for (int i = nrounds_ - 1; i >= 0; --i)
            round(right, left, Rounds[i]);
        return combine(left, right);
    }

private:
    std::tuple<uint64_t, uint64_t> split(uint64_t msg) const {
        auto right = msg bitand mask_;
        auto left = (msg >> shift_) bitand mask_;
        return std::make_tuple(left, right);
    }

    uint64_t combine(uint64_t left, uint64_t right) const {
        return (left << shift_) bitor right;
    }

    void round(uint64_t& left, uint64_t& right, uint64_t constant) const {
        auto prf_value = prf_(right, constant) bitand mask_;
        auto r = left ^ prf_value;
        left = right;
        right = r;
    }

    static constexpr uint64_t Rounds[] = {
        0x88ef7267b3f978daull,
        0x5457c7476ab3e57full,
        0x89529ec3c1eec593ull,
        0x3fac1e6e30cad1b6ull,
        0x56c644080098fc55ull,
        0x70f2b329323dbf62ull,
        0x08ee98c0d05e3dadull,
        0x3eb3d6236f23e7b7ull,
        0x47d2e1bf72264fa0ull,
        0x1fb274465e56ba20ull,
        0x077de40941c93774ull,
        0x857961a8a772650dull
    };

    int shift_;
    uint64_t mask_;
    int nrounds_;
    PRF prf_;
};

Cycle-walk and Feisty Permutation

If the source index range happens to be an even power of two, then we can simply call encode on the Feistel network to map a source index to a pseudo-random target index. In general, however, the Feistel network may return an encoding that is outside the source index domain. The solution is to simply call encode on the out-of-range index until we get an index that is in the source index domain. This recursion will terminate because the Feistel network encryption is bijective and the domain is finite. For the worst case source index range (i.e. one more than an even power of two), their will be an average of almost four calls to encode for a balanced network or two for an unbalanced network. The following class implements the basic logic along with mapping the source index domain from min,max to 0,max-min.

Results

All of the code and images can be found at GitHub in the 76076957 directory. I used the following driver for testing and generating the performance metrics all of which use three rounds in the Feistel network. I wrote the code for clarity and I have not yet done any performance work, although, I think the inner loops are already pretty efficient.

#include "core/util/tool.h"
#include "core/chrono/stopwatch.h"
#include "core/string/lexical_cast_stl.h"

template<class Work>
void measure(std::ostream& os, std::string_view desc, Work&& work) {
    chron::StopWatch timer;
    timer.mark();
    if (work())
        os << fmt::format("{:>12s}: work failed", desc) << endl;
    auto millis = timer.elapsed_duration<std::chrono::milliseconds>().count();
    os << fmt::format("{:>12s}: {:5d} ms", desc, millis) << endl;
}

int tool_main(int argc, const char *argv[]) {
    ArgParse opts
        (
         argValue<'m'>("range", std::make_pair(0, 16), "Permutation range min:max"),
         argValue<'r'>("rounds", 3, "Number of rounds"),
         argFlag<'p'>("performance", "Measure performance"),
         argFlag<'s'>("sort", "Sort index based on PRF")
         );
    opts.parse(argc, argv);
    auto [min, max] = opts.get<'m'>();
    auto rounds = opts.get<'r'>();
    auto measure_performance = opts.get<'p'>();
    auto sort_index = opts.get<'s'>();

    if (measure_performance) {
        PseudoRandomPermutation perm(min, max, rounds, &pseudo_random_function);
        measure(cout, "Permutation", [&]() {
            for (auto i = perm.min(); i < perm.max(); ++i) {
                auto code = perm.encode(i);
                if (code < perm.min() or code > perm.max())
                    return true;
            }
            return false;
        });
    } else if (sort_index) {
        std::vector<uint64_t> codes;
        for (auto i = min; i < max; ++i)
            codes.push_back(i);
        std::sort(codes.begin(), codes.end(), [](uint64_t a, uint64_t b) {
            return iterate_prf(a, 3) < iterate_prf(b, 3);
        });
        for (auto elem : codes)
            cout << elem << endl;
    } else {
        std::set<uint64_t> codes;
        PseudoRandomPermutation perm(min, max, rounds, &pseudo_random_function);
        for (auto i = min; i < max; ++i) {
            auto code = perm.encode(i);
            assert(code >= min and code <= max);
            codes.insert(code);
            cout << i << " " << code << endl;
        }
        assert(codes.size() == max - min);
    }

    return 0;
}

I have not done any statistical tests, but have simply eyeballed the plots and based on the eyeball tests I believe this answer satisfies the criteria:

It looks random (cannot differentiate from std::shuffle).
It looks uniformly distributed (straight line cumulative sum).
It is efficient (tens of nanoseconds / index).
Computable directly from a single actual index.

Basic Performance Measurements

39ns / index on Mac M1 Pro (arm64, MacOSX)
52ns / index on Intel Xeon ES-2698 @ 2.2Ghz (x86, Ubuntu 20.04)

Randomness

The following two plots compare using std::shuffle versus feisty for creating the pseudo-random permuation for 20k indices. The third plot shows the cumulative sum of the pseudo-random indices which should be a straight line for a uniform distribution.

Round Behavior

Just for curiosity's sake, here are plots for using from 1 to 5 rounds of the Feistel network. As suggested by theory, there needs to be a least three rounds to achieve good results.