Search code examples
cassandracassandra-2.1cassandra-cli

cassandra - simple/basic data modeling to retrieve all employees


Creating the following employee column family in Cassandra

Case 1:

CREATE TABLE employee (
    name  text,
    designation text,
    gender text,
    created_by text,
    created_date timestamp,
    modified_by text,
    modified_date timestamp,
    PRIMARY KEY (name)
);

From UI, if i wanted to get all employee, it is not possible to retrieve. is it true?

select * from employee; //not possible as it is partitioned by name

Case 2: I was told to do this way to retrieve all employees. We need to design this with a static key, to retrieve all the employees.

CREATE TABLE employee (
    static_name text,
    name  text,
    designation text,
    gender text,
    created_by text,
    created_date timestamp,
    modified_by text,
    modified_date timestamp,
    PRIMARY KEY (static_name,name)
);

static_name i.e.) "EMPLOYEE" will be the partition key and name will the clustering key. Primary key, combination of both static_name and name

static_name -> every time you add the employee , insert with the static value i.e) EMPLOYEE

now, you will be able to do "select all employees query"

//this will return you all the employees select * from employee where static_name='EMPLOYEE';

is this true? can't we use case 1 to return all the employees?


Solution

  • Both approaches are o.k. with some catches

    Approach 1:

    When you say UI I guess you mean to use simple select * ... it's correct that this won't really work out of the box if you want to get every single one of them out. Especially if the data set is big. You could use pagination on a driver (I'm not 100% sure since I hadn't had a case in a while to use it) but when I needed to jump over all the partition I would use the token function i.e.:

    select token(name), name from employee limit 1;
    
     system.token(name)   | name
    ----------------------+------
     -8839064797231613815 |    a
    

    now you use the result of the token and put it into next query. This would have to be done by your program. After it would fetch all the elements that are greater than ... you would also need to start for all lower than the -8839064797231613815.

    select token(name), name from employee where token(name) > -8839064797231613815 limit 1;
    
     system.token(name)   | name
    ----------------------+------
     -8198557465434950441 |    c
    

    and then I would wrap this into a loop until I would fetch all the elements out. (I think this is also how spark cassandra does it when retrieving wide rows out from a cluster).

    Disadvantage of this model is that it's really bad because it has to go all over the cluster and is more or less to be used in analytical work loads. Since you mentioned UI, It would take the user too long to get the result, so I would advise not to use approach 1 in UI related stuff.

    Approach 2.

    Disadvantage of the second one is that it would be what is called a hot row. Meaning every update would go to a single partition and this is most of the time bad model.

    The advantage is that you could easily paginate over the one partition and get your data out by pagination functions built into the driver.

    This would how ever behave just fine if you have moderate load (tens or low hundreds updates per second) and relatively low number of users, let's say for 100 000 this would work just fine. If your numbers are greater you have to somehow split up the users into multiple partitions so that the "load" gets distributed more evenly.

    One possibility is to include letter of alphabet into "EMPLOYEE" ... so you would have "EMPLOYE_A", "EMPLOYEE_B" etc ... this would work relatively well. Not ideal again because of the lexicographical distribution and some partitions may get relatively larger amounts of that which is also not ideal.

    One approach would be to create some artificial columns, let's say by design you say there are 10 buckets and when you insert into "EMPLOYEE" partition you just add (random bucket to the static prefix) "EMPLOYEE_1" and so on ... but when retrieving you go over specific partition until you exhaust the result.