Search code examples
pythonrandommd5uuid

Can hash algorithm such as MD5/SHA-1 generate an ID with less probability of collision than pure random number?


I have an application which uses a hash algorithm (currently MD5) to generate a unique ID in a database table. The hash is calculated based on some fields of a row, but nothing checks that calculation, for when those fields are changed later on, the ID of that row doesn't change.

Now I want to change the code to add some new features, while generating a purely random number for the ID could simplify my work a lot (it's a long story to tell why it is much easier for me to generate that ID before I'm able to get all the necessary fields' content for the hash algorithm)

I know that usually the programming language's own random generator generates pseudo random number, but I'm using Python's random.SystemRandom(), which uses operating system's cryptography level 'true' random generator, so I believe it should be the same collision probability comparing with generating the ID with hash algorithm.

Is my understanding correct? If not, why?


Solution

  • Generating X number of bytes of random data gives exactly the same collision probability as using the hash function on some ID's...

    ASSUMING...

    1. The columns you're using the hash function on are themselves unique.
    2. You haven't made mistakes doing #1

    I would recommend using the system's cryptographic random number provider. Because you've probably made mistakes. Here's an easy one:

    Your system: Concatenate column 1 and column 2, and hash the result. You can guarantee you'll never ever do this on those values of column 1 and column 2 ever again. NEVER.

    What about when:

    1. Column 1 = "abc"
    2. Column 2 = "def"

    vs

    1. Column 1 = "ab"
    2. Column 2 = "cdef"

    Those would create the same hash function.

    So who would you trust more to give you random data? Yourself? Or a team of operating system developers including cryptography experts and decades of research and experience? :)

    Go with the system's cryptographic random function.