Search code examples
databasetime-seriesquestdb

QuestDB optimizing for out of order data


Would it be possible to configure O3 for the following example script ? It is generating random data being sent every 5 seconds. In a first case, if the timestamp are chosen randomly over the last 4 seconds, the disk usage is very low. In a second case, by change the 4 to a 6 on line 23, I generate the timestamp randomly over the last 6 seconds, therefore with an overlap between batches. In this case, the disk usage is high (~1Mo/s). Even a small overlap is causing the problem.

How can it be optimized to avoid such a high disk usage? Could I specify the O3 buffer to be longer ? O3 paramertes are currently the default ones in QDB 8.2.1.

import pandas as pd
import numpy as np
import datetime
import time
import random

from questdb.ingress import Sender, TimestampNanos

def send(n):
    symbols = ['ETH-USD', 'BTC-USD']
    sides = ['sell', 'buy']
    price = 1 + 9999 * np.random.rand(n)
    amount = np.random.rand(n)

    isymbols = np.random.randint(0, len(symbols), size=n)
    isides = np.random.randint(0, len(sides), size=n)
    symbol = np.array(symbols)[isymbols]
    side = np.array(sides)[isides]

    now = pd.Timestamp.now()

    # create a list of timestamp over 4 or 6 seconds
    date_range = list(pd.date_range(end=now, periods=n, freq=str(4/n) + 's'))

    random.shuffle(date_range)

    df = pd.DataFrame({
        'symbol': pd.Categorical(symbol),
        'side': pd.Categorical(side),
        'price': price,
        'amount': amount,
        'timestamp': date_range})

    conf = f'http::addr=localhost:9000;'
    with Sender.from_conf(conf) as sender:
        for i in range(len(df)):
            line = df.iloc[i]

            sender.row('test_table',
                        symbols={'symbol': line['symbol'], 'side': line['side']},
                        columns={'price': float(line['price']), 'amount': float(line['amount'])},
                        at=line['timestamp'])
        sender.flush()
    
total_time = 10 * 60  # 10 minutes

# Temps initial
t0 = time.time()

while time.time() - t0 < total_time:
  send(5000)
  time.sleep(5)

Solution

  • Note that some of O3 (out-of-order) parameters in the QuestDB config are for non-WAL tables only.

    In general, you could try increasing cairo.o3.lag.calculation.windows.size and cairo.o3.max.lag.

    You could also try lowering partition size i.e if you have day partitions, try hours. That way, rewriting would happen over smaller files and should be faster.

    The O3 process is copy-on-write so reasonable write costs are expected. To avoid this entirely, you can also split your time-series data into separate tables, so values for different symbols would be in different tables and might have fewer chances of O3. If you try this route, you should reduce cairo.writer.data.append.page.size.

    And just the obvious one, having a fast disk makes a huge difference. If using a cloud provider with block storage, try maximizing the specs.