Search code examples
filtercassandramodeling

Cassandra Modeling for filter and range queries


I'm trying to model a database of users. These users have various vital statistics: age, sex, height, weight, hair color, etc.

I want to be able to write queries like these:

get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds

or

get all users who are men who are 6'0" are ages 31-37 and have black hair

How can I model my data in order to make these queries? Let's assume this database will hold billions of users. I can't think of an approach that wouldn't require me to make MANY requests or cluster the data on VERY few nodes.

EDIT:

Just a little more background, let's assume this thought problem is to build a dating website. The site should allow users to filter people based on the aforementioned criteria (age, sex, height, weight, hair, etc.). These filters are optional, and you can have as many as you want. This site has 2 billion users. Is that something that can be achieved through data modeling alone?

IF I UNDERSTAND THINGS CORRECTLY If I have 2 billion users and I create both of the tables mentioned in the first answer (assuming options of male and female for sex, and blonde, brown, red for hair color), I will, for the first table, be putting at most 2 billion records on one node if everyone has blonde hair. Best case scenario, 2/3 billion records on three nodes. In the second case, I will be putting 2/5 billion records on each node in the best case with the same worst case. Am I wrong? Shouldn't the partition keys be more unique than that?


Solution

  • Just to reiterate the end of the conversation:

    "Your understanding is correct and you are correct in stating that partition keys should be more unique than that. Each partition had a maximum size of 2GB but a practical limit is lower. In practice you would want your data partitioned into far smaller chunks that the table above. Given the ad-hoc nature of your queries in your example I do not think you would be able to practically do this by data modelling alone. I would suggest looking at using a Solr index on a table. This would allow you a robust search capability. If you use Datastax you are even able to query this via CQL"

    Cassandra alone is not a good candidate for this sort of complex filtering across a very large data set.