I'm looking into the source code for the function sample in random.py (python standard library).
The idea is simple:
My Question
There are a few constants involved, setsize = 21
and setsize += 4 ** _log(3*k,4)
. The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list
and # table size for big sets
.
The comments shed some light, however I find they bring as many questions as they answer.
Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.
The code
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence or set.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
if isinstance(population, _Set):
population = tuple(population)
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
randbelow = self._randbelow
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).
This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).
It looks like 21
was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:
>>> sys.getsizeof(set()) - sys.getsizeof([])
160
and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.
I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:
setentry smalltable[PySet_MINSIZE];
The
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.
Finally,
if n <= setsize:
compares the base overhead of a set plus any space used by an external hash table to the n
pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population)
, so it may be underestimating the cost of the list.)