Search code examples
pythonmongodbinfluxdbtimescaledb

How to plan the Timescaledb for huge price data with simple table column


I am trying out the timescaledb to store price data. The price table schema is as follows

CREATE TABLE prices(
  time TIMESTAMPTZ NOT NULL,
  pid  VARCHAR(1024) NOT NULL,
  price DOUBLE PRECISION NOT NULL
)

We need to keep the price data for most 183 days. The current database we use for the price data is mongodb. We have already 9 billion records in mongodb that is really slow to insert and remove records over 183 days.

The query is pretty simple. Given the pid and the date range, for example, 7das, the query returns a tuple containing the average price, max price and min price. The RPS for query is around 20.

Plus, we daily curate around 30 million price records that will be inserted into mongodb daily in the early morning around 2 AM. It takes at least 4 hours to complete the insertion.

Looking and testing out timescaledb for such a use case, I, of course, enabled hyepertable and set interval chunk for 24 hours and also create an index on pid. This is what I have been done for far to complete the insertion of 30 million records within 20 minutes with python psycopg2.copy_from. Would like to know what other caveats or suggestions for the case I am encountering in terms of planning out timescaledb in an optimal way?

p.s. I did try influxdb but it didn't really perform well when series cardinality is over a million or so.


Solution

  • (Timescale co-founder)

    A few suggestions:

    1. Make your composite index on pid, timestamp desc, not just pid.

    https://blog.timescale.com/blog/use-composite-indexes-to-speed-up-time-series-queries-sql-8ca2df6b3aaa/

    1. See if creating a continuous aggregate would help your use case. For example, something like:
    • CREATE VIEW prices_daily WITH (timescaledb.continuous) AS SELECT pid, time_bucket('1 day', time) AS bucket, min(prices) as min_price, max(price) as max_price FROM prices GROUP BY pid, bucket;

    https://docs.timescale.com/latest/using-timescaledb/continuous-aggregates

    1. Explore the use of compression to both reduce storage and enable faster scans over long time-periods. In particular, given that your queries are usually by pid, I would probably use something like:
    • ALTER TABLE prices SET (timescaledb.compress, timescaledb.compress_segmentby = 'pid');

    https://docs.timescale.com/latest/using-timescaledb/compression

    Also, welcome you to join the TimescaleDB community on slack for these types of questions: https://slack.timescale.com/