Search code examples
pythonpython-3.xrandomsample

Justification of constants used in random.sample


I'm looking into the source code for the function sample in random.py (python standard library).

The idea is simple:

  • If a small sample (k) is needed from a large population (n): Just pick k random indices, since it is unlikely you'll pick the same number twice as the population is so large. And if you do, just pick again.
  • If a relatively large sample (k) is needed, compared to the total population (n): It is better to keep track of what you have picked.

My Question

There are a few constants involved, setsize = 21 and setsize += 4 ** _log(3*k,4). The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list and # table size for big sets.

  • Where have these specific numbers come from? What is there justification?

The comments shed some light, however I find they bring as many questions as they answer.

  • I would kind of understand, size of a small set but find the "minus size of an empty list" confusing. Can someone shed any light on this?
  • what is meant specifically by "table" size, as apposed to say "set size".

Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.

The code

def sample(self, population, k):
    """Chooses k unique random elements from a population sequence or set.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use range as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(range(10000000), 60)
    """

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    if isinstance(population, _Set):
        population = tuple(population)
    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
    randbelow = self._randbelow
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError("Sample larger than population or is negative")
    result = [None] * k
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize:
        # An n-length list is smaller than a k-length set
        pool = list(population)
        for i in range(k):         # invariant:  non-selected at [0,n-i)
            j = randbelow(n-i)
            result[i] = pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        selected = set()
        selected_add = selected.add
        for i in range(k):
            j = randbelow(n)
            while j in selected:
                j = randbelow(n)
            selected_add(j)
            result[i] = population[j]
    return result

(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).


Solution

  • This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).

    It looks like 21 was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:

    >>> sys.getsizeof(set()) - sys.getsizeof([])
    160
    

    and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.


    I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:

    setentry smalltable[PySet_MINSIZE];
    

    The

    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    

    check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.

    Finally,

    if n <= setsize:
    

    compares the base overhead of a set plus any space used by an external hash table to the n pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population), so it may be underestimating the cost of the list.)