Adjust items chance to be selected from a list

I have a list of items. When I create the list each item has equal chance to be selected. But as an item is selected its chance goes down while the others chance goes up. If a new item is added during the process, it should have the highest chance to be selected with its chances falling off as it is selected. I am looking for a good algorithm that can accomplish this is C#.

Generalizaed idea: I have 5 items, over time all 5 items will be selected 20% of time time. I am trying to keep the selections a close to that 20% as possible, cutting down on outlyers. If one exists it will be selected more/less to bring it back in line.

Solution

Here we will engineer a random number generator that has a distribution that favors low values. You can use it to prefer items at the beginning of a list. To decrease the odds of something being selected, move that item down the list. You have a few options for how you want to move the item down the list. Lets review the random variable transformation first.

By applying the following function to a uniform random variable between 0 and 1:

index = Int(l*(1-r^(0.5)) # l=length, r=uniform random var between 0 and 1

You get a cool distribution that drastically reduces the odds of a larger index

p(0)=0.09751
p(1)=0.09246
p(2)=0.08769
p(3)=0.08211
p(4)=0.07636
p(5)=0.07325
p(6)=0.06772
p(7)=0.06309
p(8)=0.05813
p(9)=0.05274
p(10)=0.04808
p(11)=0.04205
p(12)=0.03691
p(13)=0.03268
p(14)=0.02708
p(15)=0.02292
p(16)=0.01727
p(17)=0.01211
p(18)=0.00736
p(19)=0.00249

Here is the distribution for a list of size 2

0.75139
0.24862

Size 3

0.55699
0.33306
0.10996

Size 4

Now lets discuss the two options for moving the items down the list. I tested two:

ToEnd - Move most recently selected item to end of list
Sort - Keep an associated array of the number of times each item has been selected and sort the list from least to most selected.

I created a simulation to pick from a list and examine the standard deviation of the count that each item was selected. The lower the standard deviation the better. For example, 1 simulation for a list of 10 items where 50 selections where made created the spread:

{"a"=>5, "b"=>5, "c"=>6, "d"=>5, "e"=>4, "f"=>4, "g"=>5, "h"=>5, "i"=>6, "j"=>5}

The Standard Devation for this simulation was

0.63

With the ability to run a simulation, I then computed some meta-statistics by running the simulation 500 times and providing the average standard deviation for each method: ToEnd and Sort. I expected for the standard deviation to be high for low # of picks, but in fact for the ToEnd algorithm the Standard Deviation increased with the number of picks. The sort method fixed this.

Testing ["a", "b", "c", "d", "e"]
-------------------------
Picks    ToEnd (StdDev) Sort (StdDev)
5       0.59        0.57
10      0.76        0.68
15      0.93        0.74
20      1.04        0.74
25      1.20        0.73
30      1.28        0.73
35      1.34        0.74
40      1.50        0.75
45      1.53        0.75
45      1.56        0.77
80      2.04        0.79
125     2.52        0.74
180     3.11        0.77
245     3.53        0.79
320     4.05        0.75
405     4.52        0.76
500     5.00        0.78

Here are some test results for a larger set.

Testing ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
-------------------------
Picks  ToEnd (StdDev)  Sort (StdDev)
10          0.68        0.65
20          0.87        0.77
30          1.04        0.80
40          1.18        0.82
50          1.30        0.85
60          1.43        0.84
70          1.57        0.87
80          1.65        0.88
90          1.73        0.87
90          1.71        0.87
160         2.30        0.89
250         2.83        0.88
360         3.44        0.88
490         3.89        0.88
640         4.54        0.87
810         5.12        0.88
1000        5.66        0.85

With a good test framework, I couldn't resist trying a different random number transformation. My assumption was if I take the cube root instead of square root of x that my standard deviation would decrease. It did in fact, but I was worried that this would decrease the randomness. Here you can observe a few simulations when the formula is changed to

index = Int(l*(1-r^(0.33)) # l=length, r=uniform random var between 0 and 1

Now inspect the actual picks. As I thought, its very weighted to the beginning of the list at first. If you want to weight this heavily, you should probably randomize your list before you start.

StdDev = 0.632455532033676
{"a"=>10, "b"=>10, "c"=>11, "d"=>9, "e"=>10}
a d e c b c e b a d b e c a d d e b a e e c c b d a d c b c e b a a d d b e a e a b c b d c a c e c

StdDev = 0.0
{"a"=>10, "b"=>10, "c"=>10, "d"=>10, "e"=>10}
b c d a a d b c b a d e c d e c b e b a e e d c c a b a d a e e b d b a e c c e b a c c d d d a b e

StdDev = 0.632455532033676
{"a"=>9, "b"=>10, "c"=>10, "d"=>10, "e"=>11}
b d a e b c a d c e e b a c a d d c b c e d a e b b a c d c d a e a e e b d c b e a b c b c d d e e

StdDev = 0.0
{"a"=>10, "b"=>10, "c"=>10, "d"=>10, "e"=>10}
a b e d c e a b d c b c c a d b e e b e a d d c e a d b b c c a a a b e d d e c c a b b e a c d e d

StdDev = 0.632455532033676
{"a"=>11, "b"=>10, "c"=>9, "d"=>10, "e"=>10}
b a d c d c a e b e a e b c d b c a a d e e d c d e c b a b b e d c d b c e a a a d b c e b e a d a