Search code examples
databasenosqlcassandrasparse-matrix

Can Cassandra partition tables based on date/timestamp?


I have a very big table with many columns. Values in some of the columns change rarely, and as its bad to store all of this data in a single table, I would like to partition the table into many tables based on timestamp. That is, for one timestamp, one table is made, while querying all these tables should give the abstraction of one single table. That is, the query should be only executed on the required tables (based on the time range query) and all these results should be merged.

Thus I need two functionalities:

  • Automatic sparse implementation
  • Storage as multiple table based on timestamp and abstraction of hitting a single table

Which tool is best suited for this purpose? Would Cassandra be suitable?


Solution

  • Conventional SQL databases such as PostgreSQL can handle several TB (maximum theoretical table size is 32TB). Some can handle much larger volumes of data, though this generally requires partioning the data around a cluster of machines.

    10 columns is not very many - PostgreSQL has a maximum of 250-16600 columns per table depending on the column type. Indexing on time is provided, so there should be no need to partition by timestamp, given that you still need to query the data (i.e. you are not archiving old data).

    Cassandra can handle much larger volumes of data than this, but typically one would use multiple nodes in a cluster to share the load and provide replication. The typical advice seems to be to use one node per TB if the system is heavily read/write loaded, or more (2-3 TB ?) if it is lightly loaded.

    Cassandra doesn't use tables as such. It has column families, which contain rows of sparse columns (up to 2 billion per row). Again, partitioning of data should not be required, in general - you can store a vast number of rows in a single column family (under the hood, they are partitioned across your nodes, and further partitioned into files called SSTables).

    Whether Cassandra is suitable depends somewhat on the types of queries you want to make. Cassandra does not providie flexible SQL queries, so you need to structure the data to suit the queries.