Should probabilities be managed outside of Hypothesis?

from hypothesis import strategies as st
import random
import time
import math

Imagine your code expects a list sampling four elements. Some elements are more likely than others. You don't know the list's length, so you've modeled the list as a series of Bernoulli trials, where a false outcoming concludes the list.

List elements and probabilities:

"a": 10
"b": 20
"c": 25
"d": 35

Failure probability: 0.1, which leads to an expected list length of 9 elements.

Here's the strategy:

@st.composite
def choices_bernoulli(draw, population, weights, failure_weight):
    """
    Simulate a Bernoulli process that stops after the first failure,
    with failure probability given by 'failure_weight'.
    """
    random = draw(st.randoms())
    fail = object()
    results = []
    while True:
        choice = random.choices\
            ((*population, fail), (*weights, failure_weight), k=1)[0]
        if choice is fail:
            break
        results.append(choice)
    return results

choices_bernoulli("abcd", [10,20,25,35], 10)

For me, this takes a depressing ~35 seconds to generate 100 samples. But even before that, let's consider that there are actually two random processes. One samples an item from a pool of known elements, and returns a known element. The other simulates a random process to generate the list's length - a new unknown quantity. Should either of these processes be specified outside of the Hypothesis?

One aspect of the Hypothesis is to sample the search space for test data. As a perfect fuzzer Hypothesis would ideally consider every possible combination of choices. From this perspective there's no reason to specify choice probabilities - every combination will be sampled. Unfortunately, the search space is almost always too large. If every combination can't be sampled, one solution is to sample the search space uniformly, which is how shrinking works. In strategies like one_of, you always order smaller sub-spaces first as that's what the Hypothesis will shrink towards. Uniform sampling attempts to exercise all code paths regardless of bias in likely actual data. Another solution is to push Hypothesis towards exploring more fully the sample space which produces more likely choices. That's what specifying choice probabilities does. Unfortunately, I've come to believe that most bugs lie in unexpected inputs, not in subtle variations of the most common denominator. So there - I've just convinced myself that specifying choice probabilities is not a very good idea. Even if it was, the Hypothesis manipulates the random generator to invalidate the choice probabilities, though this can be overridden using the use_true_random parameter. Below is a comparison of random.choice with quadratic weights compared to a random from Hypothesis. Unlike strategies.sampled_from, the Hypothesis doesn't seem to be attempting any shrinking. I'm guessing that the Hypothesis can only manipulate the sample space or state of the random number generator and has no control over the output of any random functions. As for the strategy above, it generates samples with an average length of 3-4 rather than 9.

random.choices: false random            random.choices: true random             strategies.sampled_from
 0: ***********************              0:                                      0: **************************
 1: *******************                  1:                                      1: **********************
 2: *****************************        2: **                                   2: **********************
 3: ************************             3: ****                                 3: **************************
 4: ********************                 4: *****                                4: ***************************
 5: **************************           5: ********                             5: ************************
 6: *********************                6: *************                        6: **************************
 7: **********************               7: *****************                    7: ***************************
 8: *************************            8: **************************           8: *****************************
 9: **********************               9: *****************************        9: *********************

Now that I've decided to avoid probabilities in lieu of sampling the search space uniformly, how can I do this for variables that are measured solely as probabilities, such as list length? My first attempt was to discard the binomial distribution in lieue of generating data that would share some of the original statistics, such as mean. How about sampling the list length from a uniform distribution with the same mean as the original negative binomial distribution? Well, it didn't work and I shouldn't have been surprised. The list strategy from Hypothesis isn't normally distributed! The strategy below generates lists with an average length of 4-5 rather than the expected 9.

@st.composite
def choices_bernoulli_mean(draw, population, failure_weight):
    fail_norm = failure_weight / 100
    expected_k = 1 * (1 - fail_norm) / fail_norm
    max_size = int(expected_k * 2)
    return draw(st.lists
        (st.sampled_from(population), min_size=0, max_size=max_size))

Now I decide to sample the length directly from the negative binomial distribution rather than simulating a random process. The list strategy will be fed a fixed length, this sample. Again we run into the same issue, this time with random.uniform, which is no longer uniform. It's the same when replacing the uniform distribution with a floats strategy. Both have a penchant for extremes, and extremely large lists are very slow to generate - both time out with Unsatisfiable.

@st.composite
def choices_bernoulli_dist(draw, population, failure_weight):
    def inverse(i, p):
        if i <= 0 or i > p:
            raise ValueError
        return math.floor(math.log(i / p) / math.log(1 - p))
    p = failure_weight / 100

    random = draw(st.randoms(use_true_random=False))
    i = random.uniform(0, p)
    hypothesis.assume(0 < i <= p)

    # An alternative source:
    #i = draw(st.floats\
    #    (min_value=0, max_value=p, exclude_min=True, exclude_max=False))

    k = inverse(i, p)
    return draw(st.lists
        (st.sampled_from(population), min_size=k, max_size=k))

From this point, we can either sample the negative binomial distribution with a true random number generator, or we can accept the Hypothesis random and impose our own minimum and maximum list sizes. If we impose limits our problem transforms to one of sampling from a pool of known elements, for which we decided not to impose any external distributions. Even if we don't impose limits it's now clear that the random process is biasing the data we can generate in ways that may not be compatible with how the Hypothesis searches the sample space. Of course, we don't instead want a uniform distribution. A 99th-percentile sample of from the negative binomial would be a list of 43 members, and that's just too long for a uniform distribution. So perhaps we only want to consider 75% of possible samples (length 13). Or perhaps 50% (length 6). And once again we are imposing limits. Perhaps we want a Hypothesis-driven distribution, which attempts only a few large values that then shrink down. I honestly have no idea. What are your thoughts?

As for performance, the real bottlenecks are the calls to st.sampled_from or random.choices. By comparison, generating the list itself is downright cheap.

Solution

Should probabilities be managed outside of Hypothesis?

In short, no: just specify what inputs are possible, and let the library handle their relative probabilities.

Now that I've decided to avoid probabilities in lieu of sampling the search space uniformly, how can I do this for variables that are measured solely as probabilities, such as list length?

As with the elements, I'd recommend leaving this to Hypothesis itself: specify st.lists(elems, min_size=x, max_size=y) with whatever defines the limits of valid inputs for your code, and trust that the engine will work it out.

We have some fancy private internals for biased coins (etc.), plus a pile of machinery for inducing bugs by systematically skewing distributions in various terrible ways. Ultimately, if you need even more exploratory firepower I'd reach for feedback-guided techniques like HypoFuzz or possibly CrossHair, and eventually hand-write some unit tests to check whatever bizarre edge cases I could think of.