cassandra time-series cassandra-2.0 finance cassandra-3.0

Efficient Cassandra DB design to retrieve a summary of time series financial data

I am looking to use the apache cassandra database to store a time series of 1 minute OHLCV financial data for ~1000 symbols. This will need to be updated in real-time as data is streamed in. All entries where time>24hr oldare not needed and should be discarded.

Assuming there are 1000 symbols with entries for each minute from the past 24 hrs, the total number of entries will amount to 1000*(60*24) = 1,440,000.

I am interested in designing this database to efficiency retrieve a slice of all symbols from the past [30m, 1h, 12h, 24h] with fast querying times. Ultimately, I need to retrieve the OHLCV that summarises this slice. The resulting output would be {symbol, FIRST(open), MAX(high), MIN(low), LAST(close), SUM(volume)} of the slice for each symbol. This essentially summarises the 1m OHLCV entries and creates an [30m, 1h, 12h, 24h] OHLCV from the time of the query. E.g. If I want to retrieve the past 1h OHLCV from 1:32pm, the query will give me a 1h OHLCV that represents data from 12:32pm-1:32pm.

What would be a good design to meet these requirements? I am not concerned with the database's memory footprint on the hard drive. The real issue is with fast querying times that is light on cpu and ram.

I have come up with a simple and naive way to store each record with clustering ordered by time:

CREATE TABLE symbols ( 
 time timestamp,
 symbol text,
 open double,
 high double,
 low double,
 close double,
 volume double
 PRIMARY KEY (time, symbol)
) WITH CLUSTERING ORDER BY (time DESC);

But I am not sure how to select from this to meet my requirements. I would rather design it specifically for my query, and duplicate data if necessary.

Any suggestions will be much appreciated.

Solution

While not based on Cassandra, Axibase Time Series Database can be quite relevant to this particular use case. It supports SQL with time-series syntax extensions to aggregate data into periods of arbitrary length.

An OHLCV query for a 15-minute window might look as follows:

SELECT date_format(datetime, 'yyyy-MM-dd HH:mm:ss', 'US/Eastern') AS time, 
  FIRST(t_open.value) AS open, 
  MAX(t_high.value) AS high, 
  MIN(t_low.value) AS low, 
  LAST(t_close.value) AS close, 
  SUM(t_volume.value) AS volume 
FROM stock.open AS t_open
  JOIN stock.high AS t_high
  JOIN stock.low AS t_low
  JOIN stock.close AS t_close  
  JOIN stock.volume AS t_volume  
WHERE t_open.entity = 'ibm'
  AND t_open.datetime >= '2018-03-29T14:32:00Z' AND t_open.datetime < '2018-03-29T15:32:00Z'
GROUP BY PERIOD(15 MINUTE, END_TIME)
  ORDER BY datetime

Note the GROUP BY PERIOD clause above which does all the work behind the scenes.

Query results:

| time                 | open     | high    | low      | close   | volume | 
|----------------------|----------|---------|----------|---------|--------| 
| 2018-03-29 10:32:00  | 151.8    | 152.14  | 151.65   | 152.14  | 85188  | 
| 2018-03-29 10:47:00  | 152.18   | 152.64  | 152      | 152.64  | 88065  | 
| 2018-03-29 11:02:00  | 152.641  | 153.04  | 152.641  | 152.69  | 126511 | 
| 2018-03-29 11:17:00  | 152.68   | 152.75  | 152.43   | 152.51  | 104068 |

You can use a Type 4 JDBC driver, API clients or just curl to run these queries.

I'm using sample 1-minute data for the above example which you can download from Kibot as described in these compression tests.

Also, ATSD supports scheduled queries to materialize minutely data into OHLCV bars of longer duration, say for long-term retention.

Disclaimer: I work for Axibase.