I wrote this little C++ function that works like R's sample.int(..., replace =FALSE)
function. Essentially it draws from uniformly distributed integers and writes the results into a set until the set is of size size
.
Maybe I'm missing something here, but I find the following behaviour quite strange. Here's a reprex:
#reprex.cpp
#include <Rcpp.h>
#include <random>
#include <set>
// [[Rcpp::export]]
std::set<unsigned long long int> sample_int(
unsigned long long int N,
unsigned long long int size)
{
std::mt19937 rng(std::random_device{}());
// Create an empty set of integers.
std::set<unsigned long long int> set;
while (set.size() < size)
{
unsigned long long int value = std::uniform_int_distribution<int>(1, N)(rng);
set.insert(value);
}
return set;
}
/*** R
very_big_n <- 15^16
less_big_n <- 16^15
less_big_n < very_big_n
sample_int(15^16, 10)
sample_int(16^15, 10)
*/
Executing this using Rcpp
yields:
[R] Rcpp::sourceCpp("reprex.cpp")
[R] very_big_n <- 15^16
[R] less_big_n <- 16^15
[R] less_big_n < very_big_n
[1] TRUE
[R] sample_int(very_big_n, 10)
[1] 114533684 182757292 493592758 712746739 751345901 804523992 867187282
[8] 905509919 929228169 929784901
[R] sample_int(less_big_n, 10)
Error: segfault from C stack overflow
Am I missing something here? Why do I get that segfault when calling sample_int
with a smaller input but not with that very large one?
I'm not going to judge whether or not your code is effective, optimized or in general safe.
I will however answer your question, the answer lies within this line of code (the error is enclosed in double asterix):
unsigned long long int value = std::uniform_int_distribution**<int>**(1, N)(rng);
By changing the template type to unsigned long long i.e.:
unsigned long long int value = std::uniform_int_distribution<unsigned long long>(1, N)(rng);
You fix your stack overflow. And your function should now work with "very large" numbers. The fact that it didn't happen with "very big n" is just a coincidence.
The stack overflow happens within this function - one of the interval checks for the formula that generates the random number fails. That is because the upper limit is the one that overflows, i.e. after replicating the same error you experienced and going through the stacktrace you will get a more meaningful error message, something like this:
/usr/include/c++/12.2.0/bits/uniform_int_dist.h:97:
std::uniform_int_distribution<_IntType>::param_type::param_type(_IntType, _IntType)
[with _IntType = int]: Assertion '_M_a <= _M_b' failed.
Hope it helps!
EDIT: As Dirk Eddelbuettel mentioned in the comments using unsigned long long is an archaic from old times. Even though that in the STL documentation they state that std::uniform_int_distribution might have undefined behavior when not using any of the suggested template types, uint64_t should still work fine (I've skimmed through the implementation). The added benefit is that uint64_t is consistent across different architectures. To use the uint64_t integer type you just need to include this header:
#include <cstdint>