A couple months ago, we were using UUIDs to generate random string IDs that needed to be unique across the board. I then changed the algorithm in order to save some data and index space in our database. I tested a few ways to generate unique string IDs, and I decided to use this function:
function generateToken($length) {
$characters = '0123456789abcdefghijklmnopqrstuvwxyz';
$max = strlen($characters) - 1;
$token = '';
for ($i = 0; $i < $length; $i++) {
$token .= $characters[mt_rand(0, $max)];
}
return $token;
}
I'm using this function to generate IDs that are 20 characters long using digits and letters, or you could say these IDs are numbers in base 36. The probability of any 2 IDs colliding should be 1/36^20, but due to the birthday paradox, it can be expected for a collision to occur after about 36^10 records - that's 3.6 quadrillion records. Yet, just a few hours ago a collision occurred, when there were only 5.3 million existing records in the database. Am I extremely unlucky, or my ID-generating function is flawed with respect to randomness? I know mt_rand() isn't truly random, but it is random enough, isn't it?
I would've written a loop that checks if the generated ID is unique and generates a new one if it isn't, but I thought that the chance of getting a collision was so small that the performance cost of such a loop wasn't worth it. I will now include such a loop in the code, but I'm still interested in perfecting the ID generation function if it is indeed flawed.
The implementation of mt_rand()
in PHP is rather fluid, so it may differ from one version to the next. However, here are some excerpts from the code used in PHP version 5:
/* MT Rand */
#define PHP_MT_RAND_MAX ((long) (0x7FFFFFFF)) /* (1<<31) - 1 */
#ifdef PHP_WIN32
#define GENERATE_SEED() (((long) (sapi_get_request_time(TSRMLS_C) * GetCurrentProcessId())) ^ ((long) (1000000.0 * php_combined_lcg(TSRMLS_C))))
#else
#define GENERATE_SEED() (((long) (sapi_get_request_time(TSRMLS_C) * getpid())) ^ ((long) (1000000.0 * php_combined_lcg(TSRMLS_C))))
#endif
PHPAPI void php_srand(long seed TSRMLS_DC);
PHPAPI long php_rand(TSRMLS_D);
PHPAPI void php_mt_srand(php_uint32 seed TSRMLS_DC);
PHPAPI php_uint32 php_mt_rand(TSRMLS_D);
PHP_FUNCTION(mt_rand)
{
long min;
long max;
long number;
int argc = ZEND_NUM_ARGS();
if (argc != 0) {
if (zend_parse_parameters(argc TSRMLS_CC, "ll", &min, &max) == FAILURE) {
return;
} else if (max < min) {
php_error_docref(NULL TSRMLS_CC, E_WARNING, "max(%ld) is smaller than min(%ld)", max, min);
RETURN_FALSE;
}
}
if (!BG(mt_rand_is_seeded)) {
php_mt_srand(GENERATE_SEED() TSRMLS_CC);
}
From the last three lines above, you can see that mt_rand()
is automatically seeded the first time it is called. However, the php_mt_srand()
function takes an argument of type php_uint32
. This means there are only 232 possible seeded states for mt_rand()
. So if your script runs roughly 216 times, it is quite likely that mt_rand()
will produce the exact same sequence of random numbers.
As suggested by rossum, it would be a much better idea to apply AES encryption to an incrementing 128-bit value. If you base64-encode the encrypted results and discard the trailing ==
, then the resulting strings will only be 22 characters long.
I left the following script running while I was out this afternoon:
for i in $(seq 1 100000) ; do
php -r 'for ($n=0; $n<32; $n++) echo chr(mt_rand(97,122)); echo chr(10);' >>out
done &
As expected, the first collision occurred after about 216 iterations (which is nowhere near 2616):
$ sort <out | uniq -d
vnexqclzkaluntglgadgwzjnjfsvqfhp
$ grep -n vnexqclzkaluntglgadgwzjnjfsvqfhp out
34417:vnexqclzkaluntglgadgwzjnjfsvqfhp
52159:vnexqclzkaluntglgadgwzjnjfsvqfhp