Search code examples
time-seriesquestdb

When working with QuestDB, are symbol columns good for performance for huge amounts of rows each?


When working with regular SQL databases, indexes are useful for fetching a few rows, but not so useful when you are fetching a large amount of data from a table. For example, imagine you have a table with stock valuations of 10 stocks over time:

|------+--------+-------+
| time |  stock | value |
|------+--------+-------+
|  ... | stock1 | ...   |
|  ... | stock2 | ...   |
|  ... | ...    | ...   |
|------+--------+-------+

As far as I can tell, indexing it by stock (even with an enum/int/foreign key) is usually not very useful in a database like Postgres if you want to get data over a large period of time. You end up with an index spanning a large part of the table, and it ends up being faster for the database to do a sequential scan, for example, to get the average value over the whole dataset for each stock:

SELECT stock, avg(value) FROM stock_values GROUP BY stock

Given that QuestDB is row oriented, I would guess that it would result in better performance to have a separated column for each stock.

So, what schema is recommended in QuestDB for a situation like this? One column for each stock, or would a symbol column for each stock symbol be as good (or good enough) even if there are millions of results for each row?


Solution

  • A column per stock is not easy to achieve in QuestDB. If you create table like this

    |----------------------------------|
    | time |  stock1 | stock1 | stock3 |
    |----------------------------------|
    
    

    Then you'll have to insert all values together in one row or you end up with gaps

    |----------------------------------|
    | time |  stock1 | stock1 | stock3 |
    |----------------------------------|
    | t1   |     1.1 |        |        |
    | t2   |         |   3.45 |        |
    | t3   |         |        | 103.45 |
    |----------------------------------|
    

    Even for t1 == t2 == t3 when you do the insert as 3 operation it will still result in 3 rows.

    So symbols are a better choice here.

    Symbol can be indexed and not indexed and you may have benefits of non-indexed symbols when distinct number of them is low. Reading full table vs reading by index is the matter of index selectivity, not data range. If the selectivity is high (e.g. distinct symbol count is say 10k) fetching by index is faster than range scans.