python random sequence membership secrets

Generating random passphrases from sets of strings with secrets.random is not very random

I have a requirement for a basic passphrase generator that uses set lists of words, one for each position in the passphrase.

 def generate(self) -> str:
        passphrase = "%s-%s-%s-%s-%s%s%s" % (
            self.choose_word(self.verbs),
            self.choose_word(self.determiners),
            self.choose_word(self.adjectives),
            self.choose_word(self.nouns),
            self.choose_word(self.numbers),
            self.choose_word(self.numbers),
            self.choose_word(self.numbers),
        )

The lists chosen from contain 100 adjectives, 9 determiners, 217 nouns, 67 verbs, and digits 1-9 and the choices are made using secrets.choice

    def choose_word(cls, word_list: List[str]) -> str:
        return secrets.choice(word_list)

In theory I thought this would give me a shade under 13 billion unique passwords. However I have written a test case that generates 10,000 passphrases and checks that they are all unique through a membership check on a sequence of the generated passphrases

def test_can_generate_10000_passphrases_without_collision(passphrase: Passphrase):
    generated_passphrases = []
    for i in range(10000):
        generated_passphrase = passphrase.generate()
        assert generated_passphrase is not None and len(generated_passphrase) > 10
        assert generated_passphrase not in generated_passphrases
        generated_passphrases.append(generated_passphrase)

    assert len(generated_passphrases) == 10000

However, the test does not reflect the probability of duplicates I expected. Using pytest-repeat I setup this test to run 10,000 times and it failed/generated duplicate passwords 24 times in 4145 runs before I killed the process. In each case the output of the test is truncated but shows that a chosen passphrase was found 'in' the set of generated phrases, and the phrase is different each time.

I don't really have a specific question here, it just seems like I'm misunderstanding something, either my probability calculations are wrong and I need to add more words to the lists or something about the test in sequence check is doing a looser match then I expected.

I switched from random.choices to secrets.choices, I tried re-instantiating the password generator class between runs, tried adding checks that the password was non-empty (because empty strings always match) and also tried running in and out of a docker container thinking there might be something messing up the randomness.

Solution

Nature of the Problem

Let's start by asserting that there is a 1-to-1 mapping between k-dimensional list indices for lists of length ℓ₁,...,ℓ_k and integers in the range [0,...,Π(ℓ₁,...,ℓ_k)-1], where Π denotes the product of the set of values.

The following python class shows an implementation of the math which does that mapping.

class Mapping:

  def __init__(self, dimensions):
    self.dimensions = dimensions
    self.n_dimensions = len(dimensions)
    self.cumulative = dimensions.copy()
    self.cumulative.append(1)
    for i in range(1, self.n_dimensions):
      idx = self.n_dimensions - i
      self.cumulative[idx] *= self.cumulative[idx+1]
    self.capacity = self.cumulative.pop(0)
    self.capacity *= self.cumulative[0]

  def indices_to_int(self, indices):
    result = 0
    for i in range(self.n_dimensions):
      result += indices[i] * self.cumulative[i]
    return result

  def int_to_indices(self, index):
    indices = [None] * self.n_dimensions
    for i in range(self.n_dimensions):
      indices[i] = index // self.cumulative[i]
      index -= indices[i] * self.cumulative[i]
    return indices

Using this with a small set of indices shows the 1-to-1 mapping pretty clearly. This code:

list_lengths = [2,3,4]
map = Mapping(list_lengths)

int_set = range(map.capacity)
for number in int_set:
  index_list = map.int_to_indices(number)
  print(number, index_list, map.indices_to_int(index_list))

produces the results given below. The output consists of an integer in the left column, which maps to the list of index values in the center. That list is subsequently mapped back to the original integer to demonstrate the 1-to-1 nature of the mapping.

0 [0, 0, 0] 0
1 [0, 0, 1] 1
2 [0, 0, 2] 2
3 [0, 0, 3] 3
4 [0, 1, 0] 4
5 [0, 1, 1] 5
6 [0, 1, 2] 6
7 [0, 1, 3] 7
8 [0, 2, 0] 8
9 [0, 2, 1] 9
10 [0, 2, 2] 10
11 [0, 2, 3] 11
12 [1, 0, 0] 12
13 [1, 0, 1] 13
14 [1, 0, 2] 14
15 [1, 0, 3] 15
16 [1, 1, 0] 16
17 [1, 1, 1] 17
18 [1, 1, 2] 18
19 [1, 1, 3] 19
20 [1, 2, 0] 20
21 [1, 2, 1] 21
22 [1, 2, 2] 22
23 [1, 2, 3] 23

This means that your problem of trying to generate unique combinations of elements from a set of lists is equivalent to generating a set of unique integer values in a well-defined range. With list lengths of [2,3,4], that's range(2*3*4). For your actual problem, the applicable range is range(100*9*217*67*10*10*10), i.e., range(13_085_100_000).

You might think that if you take a sample of size 10_000 from a pool of 13 billion integers, the chance of getting any duplicates would be close to zero. The birthday problem, also known as the birthday paradox, tells us otherwise. When you sample values independently, each additional sample has the task of avoiding all of the values that have already been sampled. The probabilities of avoidance start off high, but they are less than 1, diminish as the quantity of values sampled increases, and get multiplied (due to independent sampling), so the probability of all of them simultaneously missing each other diminishes surprisingly rapidly towards zero. It turns out that with 365 days in a year (ignoring leap years), the probability of getting one or more duplicate birthdays exceeds 1/2 by the time you introduce the 23rd person to the group. By the time you get to 57 people, the probability of having two or more who share birthdays exceeds 0.99.

The pool size of 13_085_100_000 is much larger than 365, but the logic works the same way. The following program shows that with a sample size of 10_000 integer values, the analytically derived probability of having one or more duplicate values is about 0.00381. We would expect 38 out of 10_000 such experiments to have at least one duplicate. With a sample size of 135_000 the probability of duplicates exceeds 0.5.

import fractions

pool_size = fractions.Fraction(13_085_100_000)  # use big-integer rational arithmetic
samp_size = 10_000
p_no_duplicates = fractions.Fraction(1)  # no duplicates when one item in the room...
numerator = fractions.Fraction(pool_size + 1)

# ...so start with the second item.
for i in range(2, samp_size): # i is number of items currently in the room
  p_no_duplicates *= fractions.Fraction(numerator - i) / pool_size
print("For " + str(samp_size) + " items and a pool size of " +
  str(pool_size) + ", P(Duplicate) = " + str(1.0 - p_no_duplicates))

# For 10000 items and a pool size of 13085100000, P(Duplicate) = 0.0038127078842439266

One Possible Solution

A possible solution is to use python's random.sample(), which does sampling without replacement. That means that once a value is selected it gets removed from the pool of remaining candidates. (That changes the probabilities of all the remaining values in the pool, so this is not independent sampling.) No duplicates will occur until the entire pool has been sampled. Using random.sample() will guarantee unique integers which can then be mapped into the unique set of indices from which to construct your set of passphrases.

The following illustrates this using the [2,3,4] indices from the earlier example:

import random

# list_lengths = [100,9,217,67,10,10,10]
list_lengths = [2,3,4]
map = Mapping(list_lengths)

int_set = random.sample(range(map.capacity), map.capacity)
for number in int_set:
  index_list = map.int_to_indices(number)
  # Use index_list to generate a unique password 
  # from the lists of component elements.

This will produce all sets of indices uniquely, but in a random order. Toggle the comments on the list_lengths assignment lines and set the sample size to 10_000 to obtain a randomized subsample of unique indices in index_list.