Search code examples
randomtokenuuidbirthday-paradox

Random token generation - a supposedly unlikely collision occurred


A couple months ago, we were using UUIDs to generate random string IDs that needed to be unique across the board. I then changed the algorithm in order to save some data and index space in our database. I tested a few ways to generate unique string IDs, and I decided to use this function:

function generateToken($length) {
    $characters = '0123456789abcdefghijklmnopqrstuvwxyz';
    $max = strlen($characters) - 1;

    $token = '';
    for ($i = 0; $i < $length; $i++) {
        $token .= $characters[mt_rand(0, $max)];
    }

    return $token;
}

I'm using this function to generate IDs that are 20 characters long using digits and letters, or you could say these IDs are numbers in base 36. The probability of any 2 IDs colliding should be 1/36^20, but due to the birthday paradox, it can be expected for a collision to occur after about 36^10 records - that's 3.6 quadrillion records. Yet, just a few hours ago a collision occurred, when there were only 5.3 million existing records in the database. Am I extremely unlucky, or my ID-generating function is flawed with respect to randomness? I know mt_rand() isn't truly random, but it is random enough, isn't it?

I would've written a loop that checks if the generated ID is unique and generates a new one if it isn't, but I thought that the chance of getting a collision was so small that the performance cost of such a loop wasn't worth it. I will now include such a loop in the code, but I'm still interested in perfecting the ID generation function if it is indeed flawed.


Solution

  • The implementation of mt_rand() in PHP is rather fluid, so it may differ from one version to the next. However, here are some excerpts from the code used in PHP version 5:

    php_rand.h:

    /* MT Rand */
    #define PHP_MT_RAND_MAX ((long) (0x7FFFFFFF)) /* (1<<31) - 1 */ 
    
    #ifdef PHP_WIN32
    #define GENERATE_SEED() (((long) (sapi_get_request_time(TSRMLS_C) * GetCurrentProcessId())) ^ ((long) (1000000.0 * php_combined_lcg(TSRMLS_C))))
    #else
    #define GENERATE_SEED() (((long) (sapi_get_request_time(TSRMLS_C) * getpid())) ^ ((long) (1000000.0 * php_combined_lcg(TSRMLS_C))))
    #endif
    
    PHPAPI void php_srand(long seed TSRMLS_DC);
    PHPAPI long php_rand(TSRMLS_D);
    PHPAPI void php_mt_srand(php_uint32 seed TSRMLS_DC);
    PHPAPI php_uint32 php_mt_rand(TSRMLS_D);
    

    rand.c:

    PHP_FUNCTION(mt_rand)
    {
        long min;
        long max;
        long number;
        int  argc = ZEND_NUM_ARGS();
    
        if (argc != 0) {
            if (zend_parse_parameters(argc TSRMLS_CC, "ll", &min, &max) == FAILURE) {
                return;
            } else if (max < min) {
                php_error_docref(NULL TSRMLS_CC, E_WARNING, "max(%ld) is smaller than min(%ld)", max, min);
                RETURN_FALSE;
            }
        }
    
        if (!BG(mt_rand_is_seeded)) {
            php_mt_srand(GENERATE_SEED() TSRMLS_CC);
        }
    

    From the last three lines above, you can see that mt_rand() is automatically seeded the first time it is called. However, the php_mt_srand() function takes an argument of type php_uint32. This means there are only 232 possible seeded states for mt_rand(). So if your script runs roughly 216 times, it is quite likely that mt_rand() will produce the exact same sequence of random numbers.

    As suggested by rossum, it would be a much better idea to apply AES encryption to an incrementing 128-bit value. If you base64-encode the encrypted results and discard the trailing ==, then the resulting strings will only be 22 characters long.


    Addendum

    I left the following script running while I was out this afternoon:

    for i in $(seq 1 100000) ; do
      php -r 'for ($n=0; $n<32; $n++) echo chr(mt_rand(97,122)); echo chr(10);' >>out
    done &
    

    As expected, the first collision occurred after about 216 iterations (which is nowhere near 2616):

    $ sort <out | uniq -d
    vnexqclzkaluntglgadgwzjnjfsvqfhp
    
    $ grep -n vnexqclzkaluntglgadgwzjnjfsvqfhp out
    34417:vnexqclzkaluntglgadgwzjnjfsvqfhp
    52159:vnexqclzkaluntglgadgwzjnjfsvqfhp