I'm learning Cassandra, and as a practice data set, I'm grabbing historical stock data from Yahoo. There is going to be one record for each trading day.
Obviously, I need to make the stock symbol as a part of the partitioning key. I'm seeing conflicting information on whether I should make the date as part of the partitioning key, or make it a clustering column?
Realistically, the stock market is open ~253 days per year. So a single stock will have ~253 records per year. I'm not building a full scale database, but would like to design it to accommodate / correctly.
If I make the date part of the partition key, won't that be possibly be spread across nodes? Make a date range query slow?
If I make the date part of the partition key, won't that be possibly be spread across nodes? Make a date range query slow?
Yes, correct on both accounts. That modeling approach is called "time bucketing," and its primary use case is for time/event data that grows over time. The good news is, that you wouldn't need to do that, unless your partitions were projected to get big. With your current projection of 253 rows written per partition per year, that's only going to be < 40kb each year (see calculation with nodetool tablehistograms
below).
For your purposes I think partitioning by symbol
and clustering by day
should suffice.
CREATE TABLE stockquotes (
symbol text,
day date,
price decimal,
PRIMARY KEY(symbol, day))
WITH CLUSTERING ORDER BY (day DESC);
With most time-based use cases, we tend to care about recent data more (which may or may not be true with your case). If so, then writing the data in descending order by day
will improve the performance of those queries.
Then (after writing some data), date range queries like this will work:
SELECT * FROM stockquotes
WHERE symbol='AAPL'
AND day >= '2020-08-01' AND day < '2020-08-08';
symbol | day | price
--------+------------+--------
AAPL | 2020-08-07 | 444.45
AAPL | 2020-08-06 | 455.61
AAPL | 2020-08-05 | 440.25
AAPL | 2020-08-04 | 438.66
AAPL | 2020-08-03 | 435.75
(5 rows)
To verify the partition sizes can use nodetool tablehistograms
(once the data is flushed to disk).
bin/nodetool tablehistograms stackoverflow.stockquotes
stackoverflow/stockquotes histograms
Percentile Read Latency Write Latency SSTables Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 124 5
75% 0.00 0.00 0.00 124 5
95% 0.00 0.00 0.00 124 5
98% 0.00 0.00 0.00 124 5
99% 0.00 0.00 0.00 124 5
Min 0.00 0.00 0.00 104 5
Max 0.00 0.00 0.00 124 5
Partition size each year = 124 bytes x 253 = 31kb
Given the tiny partition size, this model would probably be good for at least 30 years of data before any slow-down (I recommend keeping partitions <= 1mb). Perhaps bucketing on something like quartercentiry
might suffice? Regardless, in the short term, it'll be fine.
Edit:
Seems like any date portion used in the PK would spread the data across nodes, no?
Yes, a date portion used in the partition key would spread the data across nodes. That's actually the point of doing it. You don't want to end up with the anti-pattern of unbound row growth, because the partitions will eventually get so large that they'll be unusable. This idea is all about ensuring adequate data distribution.
lets say 1/sec and I need to query across years, etc. How would that bucketing work?
So the trick with time bucketing, is to find a "happy medium" between data distribution and query flexibility. Unfortunately, there will likely be edge cases where queries will hit more than one partition (node). But the idea is to build a model to handle most of them well.
The example question here of 1/sec for a year, is a bit extreme. But the idea to solve it is the same. There are 86400 seconds in a day. Depending on row size, that may even be too much to bucket by day. But for sake of argument, say we can. If we bucket on day, the PK looks like this:
PRIMARY KEY ((symbol,day),timestamp)
And the WHERE
clause starts to look like this:
WHERE symbol='AAPL' AND day IN ('2020-08-06','2020-08-07');
On the flip side of that, a few days is fine but querying for an entire year would be cumbersome. Additionally, we wouldn't want to build an IN
clause of 253 days. In fact, I don't recommend folks exceed single digits on an IN
.
A possible approach here, would be fire 253 asynchronous queries (one for each day) from the application, and then assemble and sort the result set there. Using Spark (to do everything in a RDD) is a good option here, too. In reality, Cassandra isn't a great DB for a reporting API, so there is value in exploring some additional tools.