Search code examples
dictionarygohashconcurrencygoroutine

Hash collisions for golang built-in map and string keys?


I wrote this function to generate random unique id's for my test cases:

func uuid(t *testing.T) string {
    uidCounterLock.Lock()
    defer uidCounterLock.Unlock()

    uidCounter++
    //return "[" + t.Name() + "|" + strconv.FormatInt(uidCounter, 10) + "]"
    return "[" + t.Name() + "|" + string(uidCounter) + "]"
}

var uidCounter int64 = 1
var uidCounterLock sync.Mutex

In order to test it, I generate a bunch of values from it in different goroutines, send them to the main thread, which puts the result in a map[string]int by doing map[v] = map[v] + 1. There is no concurrent access to this map, it's private to the main thread.

var seen = make(map[string]int)
for v := range ch {
    seen[v] = seen[v] + 1
    if count := seen[v]; count > 1 {
        fmt.Printf("Generated the same uuid %d times: %#v\n", count, v)
    }
}

When I just cast the uidCounter to a string, I get a ton of collisions on a single key. When I use strconv.FormatInt, I get no collisions at all.

When I say a ton, I mean I just got 1115919 collisions for the value [TestUuidIsUnique|�] out of 2227980 generated values, i.e. 50% of the values collide on the same key. The values are not equal. I do always get the same number of collisions for the same source code, so at least it's somewhat deterministic, i.e. probably not related to race conditions.

I'm not surprised integer overflow in a rune would be an issue, but I'm nowhere near 2^31, and that wouldn't explain why the map thinks 50% of the values have the same key. Also, I wouldn't expect a hash collision to impact correctness, just performance, since I can iterate over the keys in a map, so the values are stored there somewhere.

In the output, all runes printed are 0xEFBFBD. It's the same number of bits as the highest valid unicode code point, but that doesn't really match either.

Generated the same uuid 2 times: "[TestUuidIsUnique|�]"
Generated the same uuid 3 times: "[TestUuidIsUnique|�]"
Generated the same uuid 4 times: "[TestUuidIsUnique|�]"
Generated the same uuid 5 times: "[TestUuidIsUnique|�]"
...
Generated the same uuid 2047 times: "[TestUuidIsUnique|�]"
Generated the same uuid 2048 times: "[TestUuidIsUnique|�]"
Generated the same uuid 2049 times: "[TestUuidIsUnique|�]"
...

What's going on here? Did the go authors assume that hash(a) == hash(b) implies a == b for strings? Or am I just missing something silly? go test -race isn't complaining either.

I'm on macOS 10.13.2, and go version go1.9.2 darwin/amd64.


Solution

  • String conversion of an invalid rune returns a string containing the unicode replacement character: "�".

    Use the strconv package to convert an integer to text.