Search code examples
rsample

R sample function issue over 10 million values


I found this quirk in R and can't find much evidence for why it occurs. I was trying to recreate a sample as a check and discovered that the sample function behaves differently in certain cases. See this example:

# Look at the first ten rows of a randomly ordered vector of the first 10 million integers
set.seed(4)
head(sample(1:10000000), 10)
[1] 5858004   89458 2937396 2773749 8135739 2604277 7244055 9060916 9490395  731445

# Select a specified sample of size 10 from this same list
set.seed(4)
sample(1:10000000), size = 10)
[1] 5858004   89458 2937396 2773749 8135739 2604277 7244055 9060916 9490395  731445


# Try the same for sample size 10,000,001
set.seed(4)
head(sample(1:10000001), 10)
[1] 5858004   89458 2937396 2773750 8135740 2604277 7244056 9060917 9490396  731445

set.seed(4)
sample(1:10000001), size = 10)
[1] 5858004   89458 2937397 2773750 8135743 2604278 7244060 9060923 9490404  731445

I tested many values up to this 10 million threshold and found that the values matched (though I admit to not testing more than 10 output rows).

Anyone know what's going on here? Is there something significant about this 10 million number?


Solution

  • Yes, there's something special about 1e7. If you look at the sample code, it ends up calling sample.int. And as you can see at ?sample, the default value for the useHash argument of sample.int is

    useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e7)
    

    That && n > 1e7 means when you get above 1e7, the default preference switches to useHash = TRUE. If you want consistency, call sample.int directly and specify the the useHash value. (TRUE is a good choice for memory efficiency, see the argument description at ?sample for details.)