Search code examples
gorandom

Golang pseudo random string (lots of duplicates)


I am attempting to generate a random string of 5 chars. Every implementation I've tried generates duplicates after a very small sample size (10k-40k). While I understand that programmatically its impossible to generate true random strings I was surprised to see duplicates appearing at such a high frequency.

I have tried several implementations (and variations of those) without luck. Each one generates a duplicate after ~40k strings at most. Given that there are 380,204,032 unique combinations in a 5 char string consisting of [A-Z a-z] I was expecting to be able to generate significantly more strings before encountering a duplicate.

After searching around I found a couple of good sources that were the basis for my implementations

The 2nd link in particular caught my eye as the author mentioned using the "crypto/rand" package was better at avoiding duplicates. However it seemed to make little difference in how many strings I was able to generate before encountering a duplicate.

Other variations I tried

  • calling rand.NewSource after each char (instead of each string)
  • using multiple rand.NewSource and using the output of the 1st to seed the 2nd, 3rd, etc.

Can anyone provide some insight into why these random strings aren't so random and some steps I might be able to take to generate at least 1M strings before encountering a duplicate?

I know I could use a 3rd party library but I would like to avoid that.

func Test_RandomString(t *testing.T) {

    tests := map[string]struct {
        Length       int
        UniuqeValues int
    }{
        "5 x 1,000,000": {
            Length:       5,
            UniuqeValues: 1_000_000,
        },
    }

    for name, test := range tests {
        t.Run(name, func(t *testing.T) {
            actual := utils.RandomString(test.Length)
            assert.Equal(t, test.Length, len(actual))

            values := make(map[string]struct{})
            for count := 0; count < test.UniuqeValues; count++ {
                value := utils.RandomString(test.Length)

                _, found := values[value]
                if found {
                    t.Fatalf("duplicate value found after %v: %v", count, value)
                }

                values[value] = struct{}{}
            }

        })
    }
}
func RandomString_CryptoRand(length int) string {

    const letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

    var randomStr []byte = make([]byte, length)

    for i := 0; i < length; i++ {

        idx, _ := rand.Int(rand.Reader, big.NewInt(int64(len(letters))))
        randomStr[i] = letters[idx.Int64()]
    }
    return string(randomStr)
}
func RandomString_MathRand(length int) string {
    var src = rand.NewSource(time.Now().UnixNano())
    const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

    sb := strings.Builder{}
    sb.Grow(length)

    for idx := 0; idx < length; idx++ {
        sb.WriteByte(letterBytes[int64(src.Int63())%int64(len(letterBytes))])
    }

    return sb.String()
}

Solution

  • This is the birthday paradox. You mistakenly assumed that the probability of duplicates is linearly proportional to the pool-size of 525, while in reality the relationship is proportional to the square root. Sqrt(380204032) is 19498 and change, which is pretty much spot-on for what you observed.