Search code examples
nosqlcassandracqlcql3pycassa

Cassandra data modelling - dynamic columns with datetimes and filtering on UTF8Type


I'm looking to create a table to store information containing details on the URL parameters of visitors to a bunch of websites. I'm getting a dictionary of parameters and values per user_id-session_id pair for each site, each with a timestamp.

In summary: Each site has Users (who have sessions) that have parameters that have values at times.

I'm going to want to query this data like so...

For each website, given a daterange, get the customer_id of the customers with a certain parameter value.

p.s. I don't readily have a list of user id/session id so I can't use those as row keys without doing a 'get_range()' query. Which may be expensive.

Attempted solutions/thoughts...

I figured that dynamic columns would be a good idea here as we don't really know all the different sets of parameters and I'd rather not make an attempt at hard coding them in. That gives me the name of the parameter as the column header and the value of the cells as the parameter value. That's all well and good but it leaves me with composite row keys of (site, date, customer_id, session_id).

I'm all for big composite row keys but I don't think I (would love to be corrected) could query the data to get ranges of dates in this way. If so, how can I do that please?

Also thought about having site-datetime composite column headers as that allows me to filter for those properties nicely. Then I could have the row keys as (param - param value) composite keys and build a set of user_ids as values. Actually, that's sounding like my best bet at the moment I think, but it still seems fairly messy.

If anybody has got any ideas, I'd love to hear them!

Thanks,

Matt

p.p.s I'm using Pycassa, CQL3 and Cassandra 1.2 so have all the tools available to me that those platforms offer


Solution

  • @Kali

    All problem has proper solution

    For query, we have:

    • site as value

    We would like to search by date range

    CREATE TABLE my_index
    (
       site text, // row key, also known as partition key
       date timestamp, // search by date range
       user_id bigint, // user id
       session_id uuid, // session id
       parameters map<text,text> // pairs of param key/param value
       PRIMARY KEY (site,date,user_id,session_id)
    )
    

    For each site, we have one row because site = row key

    The column name is a composite of (date,*user_id*,*session_id*). Date is the first component so that you get query by date.

    *user_id* and *session_id* are there to guarantee UNICITY, meaning that for each timestamp, you can store parameters for more than 1 couple of user_id/session_id

    Now, to store parameters, use the MAP feature introduced by CQL3. Parameter names are keys of the map and can be stored as text. For parameter values, since they can be of any type (long, int, string, date ...), it is better to JSON-serialize them as text so you can store them together in the map.

    This solution require a little serialize/deserialize before persisting/readign from Cassandra

    Example of query:

    Get all users,session and parameters for website www.stackoverflow.com between '2013-09-03' and '2013-09-04'

    SELECT user_id,session_id,parameters FROM my_index 
    WHERE site = 'www.stackoverflow.com' AND date>=1378166400 
    AND date<=1378252800
    

    Get all parameters for website www.stackoverflow.com at date '2013-09-03' and user_id =10

    SELECT parameters FROM my_index 
    WHERE site = 'www.stackoverflow.com' 
    AND date=1378166400 AND user_id=10