I found this quirk in R and can't find much evidence for why it occurs. I was trying to recreate a sample as a check and discovered that the sample
function behaves differently in certain cases. See this example:
# Look at the first ten rows of a randomly ordered vector of the first 10 million integers
set.seed(4)
head(sample(1:10000000), 10)
[1] 5858004 89458 2937396 2773749 8135739 2604277 7244055 9060916 9490395 731445
# Select a specified sample of size 10 from this same list
set.seed(4)
sample(1:10000000), size = 10)
[1] 5858004 89458 2937396 2773749 8135739 2604277 7244055 9060916 9490395 731445
# Try the same for sample size 10,000,001
set.seed(4)
head(sample(1:10000001), 10)
[1] 5858004 89458 2937396 2773750 8135740 2604277 7244056 9060917 9490396 731445
set.seed(4)
sample(1:10000001), size = 10)
[1] 5858004 89458 2937397 2773750 8135743 2604278 7244060 9060923 9490404 731445
I tested many values up to this 10 million threshold and found that the values matched (though I admit to not testing more than 10 output rows).
Anyone know what's going on here? Is there something significant about this 10 million number?
Yes, there's something special about 1e7
. If you look at the sample
code, it ends up calling sample.int
. And as you can see at ?sample
, the default value for the useHash
argument of sample.int
is
useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e7)
That && n > 1e7
means when you get above 1e7
, the default preference switches to useHash = TRUE
. If you want consistency, call sample.int
directly and specify the the useHash
value. (TRUE
is a good choice for memory efficiency, see the argument description at ?sample
for details.)