Search code examples
database-designtime-seriesmariadbrdbms

Aggregating timeseries data on multiple axises?


I have many millions of time-series points coming in each day with two important axes I need to search over. My data looks like the following:

X, Y, value, TIMESTAMP

Originally these were stored in MariaDB but the table size is growing too fast. Doing simple aggregate queries (like SUM()) are taking too long even on a medium-sized server with indexes.

Here are some example queries:

SELECT COUNT(*) FROM tbl 
WHERE X = 23 AND Y = 46 AND TIMESTAMP > NOW() - INTERVAL 30 DAY

SELECT X, Y, COUNT(*) FROM tbl
WHERE TIMESTAMP > NOW() - INTERVAL 30 DAY
GROUP BY X, Y
ORDER BY COUNT(*) DESC

I have two indexes:

X, Y, value
X, Y, TIMESTAMP

I'm looking for recommendations on ways (or new databases) to store this data providing fast lookups for any combination of X and Y while filtering on TIMESTAMP or value.


Solution

  • Raymond Nijland posted a recommendation to use a materialized view (a table built from queries on other table(s)). At first I dismissed it because the query I was currently using to build the materialized view required (almost) a full table scan to run the calculation which was the issue I was trying to avoid.

    However, a materialized view can also be built one piece at a time, which turns out to be a great solution to this problem for both NoSQL and SQL databases (provided indexes).

    RDBMS

    If an insert arrived for axis X and Y, then fetch only records with the X and Y axes and re-run the calculation on them. In my case this works great because the frequency of daily inserts per axis pair is very low (though all axis pair inserts is high).

    When:

    INSERT X, Y, value, TIMESTAMP
    

    Then run:

    INSERT INTO reports (X, Y, cnt, updated_at, ...) 
    SELECT X, Y, COUNT(*), NOW(), ...(other columns)... FROM tbl 
    WHERE X = ? AND Y = ? AND TIMESTAMP BETWEEEN ? AND ?)
    

    This is a vague example, but assuming properly structured indexes and partition/primary keys you can maintain an always-updated materialized reports table of sorts.

    If there are axes which are not updated very often, you can run a second background task to identify and remove/update rows WHERE updated_at < NOW() - INTERVAL 1 DAY.

    Redis

    Atomic counters are a very useful way to keep an aggregate score for incoming metrics. After every insert, simply update a separate composite key counter for the axis you care about:

    redis> SET X#Y#2020-01-01 1
    "OK"
    redis> INCR X#Y#2020-01-01
    (integer) 2
    

    This is harder for multiple axes data.

    DynamoDB, MongoDB, etc...

    • AWS DynamoDB has "streams" which provide a way for a AWS Lambda function to be notified on a change.

    • MongoDB has the changelog which you can use to react to database updates.

    In both cases, you can run a background map/reduce on the data and update a calculation from the data you scanned.

    This is often a lot more expensive of an operation than what is done with smaller data sets that fit in memory (Redis) or a RDMBS (above).

    Note: I'm still looking for better solutions to time-series data with multiple axes on NoSQL platforms as my current recommendation is easier said than done.