TDD for an algorithm involving randomness

I would like to try out test-driven development, but the project I am working on involves a lot of randomness and I am very unsure about how I can test it. Here is a toy example of the kind of algorithm I may want to write:

Write a function taking no argument and returning a list of random integers satisfying the following properties

Each integer is between 0 and 10

The same number doesn’t appear twice

The list is of length 3 90% of the time, and of length 4 10% of the time

There is a 50% chance for the number 3 to appear

I do not need to test exact statistical distribution, but obviously I would like tests that will fail if someone completely removes the corresponding code.

I am using an external RNG that you can assume is correct, and I am pretty free in how to structure the code, so I can use dependency injection to have tests use a fake RNG instead, but I still don’t really see how that would help. For instance even if I always use the same seed for the tests, as soon as I refactor the algorithm to pick random numbers in a different order, all the tests become meaningless.

I guess that the first two points could be tested by generating many cases and checking that the constraints are satisfied, but that doesn’t really feel like TDD.

For the last two points, I’m thinking of having tests with different configurations, where for instance the 90% is either 100% or 0%, and then I can test if the length of the list is indeed 3 or 4. I guess it would work, but it seems maybe a bit weak.

Is there any guidelines or other techniques to use when using TDD to test algorithms involving randomness?

Solution

There are several ways you can go about a problem like this, and I may add another answer in the future, but the approach that I immediately found most compelling would be to combine test-driven development (TDD) with property-based testing.

You can do this in many languages, with various frameworks. Here, I'm going to use the original property-based testing library, QuickCheck.

The first two requirements translate directly to predicates that QuickCheck can exercise. The latter two translates into distribution tests - a more advanced feature of QuickCheck that John Hughes explains in this presentation.

Each one in turn.

Preliminaries

Before writing the first test, you're going to set up tests and import the appropriate libraries:

module RintsProperties where

import Test.Framework (Test)
import Test.Framework.Providers.QuickCheck2
import Test.QuickCheck
import Q72168364

where the System Under Test (SUT) is defined in the Q72168364 library. The SUT itself is an action called rints (for Random INTS):

rints :: IO [Int]

Since it's going to generate random numbers, it'll have to run in IO.

Image

The first requirement says something about the image of the SUT. This is easily expressed as a property:

testProperty "Each integer is between 0 and 10" $ \() -> ioProperty $ do
  actual <- rints
  return $
    counterexample ("actual: " ++ show actual) $
    all (\i -> 0 <= i && i <= 10) actual

If you ignore some of the ceremony involved with producing a useful assertion message and such, the central assertion is this:

    all (\i -> 0 <= i && i <= 10) actual

which verifies that all integers i in actual are between 0 and 10.

In true TDD fashion, the simplest implementation that passes the test is this:

rints :: IO [Int]
rints = return []

Always return an empty list. While degenerate, it fulfils the requirement.

No duplicates

The next requirement also translates easily to a predicate:

testProperty "The same number does not appear twice" $ \() -> ioProperty $ do
  actual <- rints
  return $ nub actual === actual

nub removes duplicates, so this assertion states that nub actual (actual where duplicates are removed) should be equal to actual. This is only going to be the case if there are no duplicates in actual.

In TDD fashion, the implementation unfortunately doesn't change:

rints :: IO [Int]
rints = return []

In fact, when I wrote this property, it passed right away. If you follow the red-green-refactor checklist, this isn't allowed. You should start each cycle by writing a red test, but this one was immediately green.

The proper reaction should be to discard (or stash) that test and instead write another one - perhaps taking a cue from the Transformation Priority Premise to pick a good next test.

For instructional reasons, however, I will stick with the order of requirements as they are stated in the OP. Instead of following the red-green-refactor checklist, I modified rints in various ways to assure myself that the assertion works as intended.

Length distribution

The next requirement isn't a simple predicate, but rather a statement about the distribution of outcomes. QuickCheck's cover function enables that - a feature that I haven't seen in other property-based testing libraries:

testProperty "Length is and distribution is correct" $ \() -> ioProperty $ do
  actual <- rints
  let l = length actual
  return $
    checkCoverage $
    cover 90 (l == 3) "Length 3" $
    cover 10 (l == 4) "Length 4"
    True -- Base property, but really, the distribution is the test

The way that cover works, it needs to have a 'base property', but here I simply return True - the base property always passes, meaning that the distribution is the actual test.

The two instances of cover state the percentage with which each predicate (l == 3 and l == 4) should appear.

Running the tests with the degenerate implementation produces this test failure:

  Length is and distribution is correct: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
Only 0% Length 3, but expected 90%

As the message states, it expected 90% of Length 3 cases, but got 0%.

Again, following TDD, one can attempt to address the immediate error:

rints :: IO [Int]
rints = return [1,2,3]

This, however, now produces this test failure:

  Length is and distribution is correct: [Failed]
*** Failed! Insufficient coverage (after 400 tests):
100.0% Length 3

Only 0.0% Length 4, but expected 10.0%

The property expects 10% Length 4 cases, but got 0%.

Perhaps the following is the simplest thing that could possibly work?

import System.Random.Stateful

rints :: IO [Int]
rints = do
  p <- uniformRM (1 :: Int, 100) globalStdGen
  if 10 < p then return [1,2,3] else return [1,2,3,4]

Perhaps not quite as random as you'd expect, but it passes all tests.

More threes

The final (explicit) requirement is that 3 should appear 50% of the times. That's another distribution property:

testProperty "3 appears 50% of the times" $ \() -> ioProperty $ do
  actual <- rints
  return $
    checkCoverage $
    cover 50 (3 `elem` actual) "3 present" $
    cover 50 (3 `notElem` actual) "3 absent"
    True -- Base property, but really, the distribution is the test

Running all tests produces this test failure:

  3 appears 50% of the times: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
100% 3 present

Only 0% 3 absent, but expected 50%

Not surprisingly, it says that the 3 present case happens 100% of the time.

In the spirit of TDD (perhaps a little undisciplined, but it illustrates what's going on), you may attempt to modify rints like this:

rints :: IO [Int]
rints = do
  p <- uniformRM (1 :: Int, 100) globalStdGen
  if 10 < p then return [1,2,3] else return [1,2,4,5]

This, however, doesn't work because the distribution is still wrong:

  3 appears 50% of the times: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
89% 3 present
11% 3 absent

Only 11% 3 absent, but expected 50%

Perhaps the following is the simplest thing that works. It's what I went with, at least:

rints :: IO [Int]
rints = do
  p <- uniformRM (1 :: Int, 100) globalStdGen
  includeThree <- uniformM globalStdGen
  if 10 < p
    then if includeThree then return [1,2,3] else return [1,2,4]
    else if includeThree then return [1,2,3,4] else return [1,2,4,5]

Not elegant, and it still doesn't produce random numbers, but it passes all tests.

Random numbers

While the above covers all explicitly stated requirements, it's clearly unsatisfactory, as it doesn't really produce random numbers between 1 and 10.

This is typical of the TDD process. As you write tests and SUT and let the two interact, you discover that more tests are required than you originally thought.

To be honest, I wasn't sure what the best approach would be to 'force' generation of all numbers between 0 and 10. Now that I had the hammer of distribution tests, I wrote the following:

testProperty "All numbers are represented" $ \() -> ioProperty $ do
  actual <- rints
  return $
    checkCoverage $
    cover 5 ( 0 `elem` actual) " 0 present" $
    cover 5 ( 1 `elem` actual) " 1 present" $
    cover 5 ( 2 `elem` actual) " 2 present" $
    cover 5 ( 3 `elem` actual) " 3 present" $
    cover 5 ( 4 `elem` actual) " 4 present" $
    cover 5 ( 5 `elem` actual) " 5 present" $
    cover 5 ( 6 `elem` actual) " 6 present" $
    cover 5 ( 7 `elem` actual) " 7 present" $
    cover 5 ( 8 `elem` actual) " 8 present" $
    cover 5 ( 9 `elem` actual) " 9 present" $
    cover 5 (10 `elem` actual) "10 present"
    True -- Base property, but really, the distribution is the test

I admit that I'm not entirely happy with this, as it doesn't seem to 'scale' to problems where the function image is much larger. I'm open to better alternatives.

I also didn't want to be too specific about the exact distribution of each number. After all, 3 is going to appear more frequently than the others. For that reason, I just picked a small percentage (5%) to indicate that each number should appear not too rarely.

The implementation of rints so far failed this new test in the same manner as the other distribution tests.

Crudely, I changed the implementation to this:

rints :: IO [Int]
rints = do
  p <- uniformRM (1 :: Int, 100) globalStdGen
  let l = if 10 < p then 3 else 4
  ns <- shuffle $ [0..2] ++ [4..10]
  includeThree <- uniformM globalStdGen
  if includeThree
    then do
      let ns' = take (l - 1) ns
      shuffle $ 3 : ns'
    else
      return $ take l ns

While I feel that there's room for improvement, it passes all tests and actually produces random numbers:

ghci> rints
[5,2,1]
ghci> rints
[9,2,10]
ghci> rints
[8,1,3]
ghci> rints
[0,9,8]
ghci> rints
[0,10,3,6]

This example used QuickCheck with Haskell, but most of the ideas translate to other languages. QuickCheck's cover function may be an exception to that rule, since I'm not aware that it's been ported to common language implementations, but perhaps I'm just behind the curve.

In situation where something like cover isn't available, you'd have to write a test that loops through enough randomly generated test cases to verify that the distribution is as required. A little more work, but not impossible.

Since Nikos Baxevanis asked, here's the shuffle implementation:

shuffle :: [a] -> IO [a]
shuffle xs = do
  ar <- newArray l xs
  forM [1..l] $ \i -> do
      j <- uniformRM (i, l) globalStdGen
      vi <- readArray ar i
      vj <- readArray ar j
      writeArray ar j vi
      return vj
  where
    l = length xs
    newArray :: Int -> [a] -> IO (IOArray Int a)
    newArray n = newListArray (1, n)

I lifted it from https://wiki.haskell.org/Random_shuffle and perhaps edited a bit.