Migrating from Mysql to Cassandra

Previously I was using the class found here to convert userID to some random string.

From his blog:

Running:

alphaID(9007199254740989);

will return 'PpQXn7COf' and:

alphaID('PpQXn7COf', true);

will return '9007199254740989'

So the idea was that users could do www.mysite.com/user/PpQXn7COf and i convert that to a normal integer so i could do in mysql

"Select * from Users where userID=".alphaID('PpQXn7COf', true)

Now i'm just started working with Cassandra an i'm looking for some replacement.

I want url like www.mysite.com/user/PpQXn7COf not like www.mysite.com/user/username1
The "PpQXn7COf" uuid must be as short as possible.

In the Twissandra example explained here: http://www.rackspace.com/cloud/blog/2010/05/12/cassandra-by-example/

They create some long uuid (i guess it is so long because then its almost 100 percent sure its random).

In mysql i just had a userID column with auto increasement so when i used the alphaID() function i always got a very short random string.

Anyone an idea how to solve this as clean as possible?

Edit:

It is used for a social media site so it must be persistent. Thats also why i don't want to use usernames/realnames in urls, user cant remain google undetected if they need.

I just got a simple idea, however i don't know how scalable it is

<?php
//createUUID() makes +- 14 char string with A-Z a-z 1-0 based on micro/milli/nanoseconds
while(get_count(createUUID()) > 0){//uuid  is unique
  //insert username pass, uuid etc into cassandra
  if($result == "1"){
      header('Location: http://www.mysite.com/usercenter');
  }else{
      echo "error";
  }
}
?>

When this gets the size of lets say twitter/facebook:

Will it execute in acceptable time?
Will it still generate unique uuid fast enough so if 10000 users/second are registering it isnt cluttering up?

Solution

Auto-increments are not suitable for a robust distributed system. You can only assign a unique ID if every node in your system is available, to ensure it's unique.

You can of course, invent your own unique-id generator, but you must then ensure that it will generate unique IDs anywhere in your infrastructure.

For example, each node can just have a file which it (with suitable locking etc) just increments, but you will also need to ensure that they don't clash - for instance, by having the server ID included in the generation algorithm.

This may be operationally nontrivial - your ops engineers will need to ensure that all the servers in the infrastructure are configured correctly with their own ID generators set up so that they don't generate the same ID. However, it's possible.

UUIDs are the reasonable alternative, because they will definitely be unique.

A UUID is 128 bits; if we store 6 bits per character (i.e. base64) then that takes 22 characters, which is quite a long URI. If you want it shorter, you will need to generate unique IDs a different way.

Plus it all depends on "how unique" you actually need your IDs to be. If your IDs can safely be reused after a few months, you can probably do it in < 60 bits (depending also on the number of servers in your infrastructure, and how frequently you need to generate them).

We use

Server ID
Time (granularity = 2 seconds), but wraps after a few months
A per-server counter (which wraps frequently, but not within 2 seconds)

And stick all the bits together. This generates an ID which is < 64 bits long, but is guaranteed to be unique for the length of time it needs to be (which in our case is only a couple of months)

Our algorithm will malfunction and generate a duplicate ID if:

The system clock on one of our nodes goes backwards by the same amount of time in which the counter wraps.
Our operations engineers make a mistake and assign the same server ID to two servers.
Eventually, after about 9 months.