indexing cassandra primary-key database-partitioning

How do partition keys work?

I am new to Cassandra and I read that the primary key is the same thing as the partition key.

My question is simple, in this case:

CREATE TABLE users (
  user_name varchar PRIMARY KEY,
  password varchar,
  gender varchar,
  session_token varchar,
  state varchar,
  birth_year bigint
);

As the partition key is responsible for data distribution accross your nodes, how will the data be distributed by username in this case?

Solution

Actually, The PRIMARY KEY is not the same as the partition key. The partition key is a part of the PRIMARY KEY. And yes, it is the part which determines how a row is distributed across the cluster.

how will the data be distributed by username in this case?

If I CREATE your table, insert some values and query it I can get a bit of a window into the distribution process by using the token function in my SELECT:

> SELECT token(user_name), user_name FROM user2;

 system.token(user_name) | user_name
-------------------------+-----------
    -5077180869401877077 |   Patdard
    -4874582970682694928 |      Robo
     4639906948852899531 |      Bill
     4645660266327417866 |       Bob
     4877648712764681009 | Valentina
     5726383012007749221 |   Helcine
     7724711996172375448 |  Jebediah

(7 rows)

Let's assume that I have 5 nodes. In Cassandra each node is responsible for a primary token range. Let's assume the following:

1)  5534023222112865485 to -9223372036854775808
2) -9223372036854775807 to -5534023222112865485
3) -5534023222112865484 to -1844674407370955162
4) -1844674407370955161 to  1844674407370955161
5)  1844674407370955161 to  5534023222112865484

Note: Ranges computed by running:

python -c 'print [str(((2**64 / 5) * i) - 2**63) for i in range(5)]'

Also depicted this way in MVP Robbie Strickland's Cassandra High Availability.

Cassandra takes the hashed token value of the partition key (user_name in this case) and uses that to determine which node the row show be distributed to. Given the hashed token values above, and the ranges that I have listed out, these are the nodes which each user name should go to:

Node 1: Helcine, Jebediah
Node 3: Patdard, Robo
Node 5: Bill, Bob, Valentina

Depending on your replication factor (RF), Cassandra may also place additional replicas of each row on other nodes.