We insert 30K rows/s into the following table:
CREATE UNLOGGED TABLE some_data
(
ts TIMESTAMP NOT NULL DEFAULT NOW(),
a_column VARCHAR
b_column VARCHAR
c_column boolean
) PARTITION BY RANGE (ts);
CREATE INDEX ON some_data (ts);
CREATE INDEX ON some_data (a_column);
CREATE INDEX ON some_data (b_column);
CREATE INDEX ON some_data (c_column);
The writers create the partition when it is unable to insert a row. The partitions are 15 minutes long. The data is ephemeral and is deleted periodically. We partition because deleting partitioned tables guarantees the data is physically removed; since our insert rate is fast, vacuuming never runs.
Furthermore, it's not important that all the data is written. Our database is mounted on a tmpfs partition, so it is in RAM. We're happy with the insert performance.
We have another process that continuously polls the above table to find the top b_column
s in the past 30 seconds using this query:
SELECT b_column, count(*) as c FROM some_data
WHERE ts >= NOW() - INTERVAL '30 SECONDS'
AND a_column = 'a_value1'
AND c_column IS NOT True
AND b_column != 'b_value4' AND b_column != ''
AND b_column != 'b_value1' AND b_column != 'b_value2'
AND b_column != 'b_value3'
GROUP BY b_column
HAVING count(*) > 15000
ORDER BY c DESC
LIMIT 100;
The column values for the WHERE
clause for this query are always the same.
Because of the amount of data we are querying, the query takes 1.5s to 2s complete. What can we do, in terms of restructuring the table or the query, to improve the performance?
Here is the analyze output:
Limit (cost=26488.73..26488.90 rows=67 width=30) (actual time=1210.177..1216.418 rows=4 loops=1)
Output: some_data.b_column, (count(*))
-> Sort (cost=26488.73..26488.90 rows=67 width=30) (actual time=1210.175..1216.415 rows=4 loops=1)
Output: some_data.b_column, (count(*))
Sort Key: (count(*)) DESC
Sort Method: quicksort Memory: 25kB
-> Finalize GroupAggregate (cost=26435.53..26486.70 rows=67 width=30) (actual time=1040.285..1216.384 rows=4 loops=1)
Output: some_data.b_column, count(*)
Group Key: some_data.b_column
Filter: (count(*) > 15000)
Rows Removed by Filter: 180403
-> Gather Merge (cost=26435.53..26482.20 rows=400 width=30) (actual time=1003.876..1136.791 rows=236999 loops=1)
Output: some_data.b_column, (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 1
-> Sort (cost=25435.51..25436.01 rows=200 width=30) (actual time=977.874..1037.332 rows=118500 loops=2)
Output: some_data.b_column, (PARTIAL count(*))
Sort Key: some_data.b_column
Sort Method: external merge Disk: 5144kB
Worker 0: actual time=994.460..1056.444 rows=115752 loops=1
Sort Method: external merge Disk: 4912kB
-> Partial HashAggregate (cost=25425.86..25427.86 rows=200 width=30) (actual time=641.936..727.838 rows=118500 loops=2)
Output: some_data.b_column, PARTIAL count(*)
Group Key: some_data.b_column
Batches: 5 Memory Usage: 8257kB Disk Usage: 7184kB
Worker 0: actual time=637.586..726.780 rows=115752 loops=1
Batches: 5 Memory Usage: 8257kB Disk Usage: 7064kB
-> Parallel Index Scan using some_data_2024_02_13_14_15_ts_idx on my_db.some_data_2024_02_13_14_15 some_data (cost=0.43..24392.95 rows=206582 width=22) (actual time=0.074..463.918 rows=382730 loops=2)
Output: some_data.b_column
Index Cond: (some_data.ts >= (now() - '00:00:30'::interval))
Filter: ((some_data.c_column IS NOT TRUE) AND ((some_data.b_column)::text <> 'b_value4'::text) AND ((some_data.b_column)::text <> ''::text) AND ((some_data.b_column)::text <> 'b_value1'::text) AND ((some_data.b_column)::text <> 'b_value2'::text) AND ((some_data.b_column)::text <> 'b_value3'::text) AND (some_data.a_column = 'a_value1'::a_column))
Rows Removed by Filter: 8091
Worker 0: actual time=0.097..466.999 rows=369674 loops=1
Planning Time: 1.563 ms
Execution Time: 1219.166 ms
Your query needs to read over 200000 rows from the table, so it will never be very fast. By increasing work_mem
, you can save some time by avoiding temporary file I/O (Sort Method: external merge Disk: 5,144kB
).
There isn't really anything you can do about that if you need to query current data, except having enough RAM that all the data are cached.