Search code examples
sqlpostgresqlquery-optimization

Improving query performance in write intensive table


We insert 30K rows/s into the following table:

CREATE UNLOGGED TABLE some_data
(
    ts           TIMESTAMP NOT NULL DEFAULT NOW(),
    a_column     VARCHAR
    b_column     VARCHAR
    c_column     boolean
)  PARTITION BY RANGE (ts);

CREATE INDEX ON some_data (ts);
CREATE INDEX ON some_data (a_column);
CREATE INDEX ON some_data (b_column);
CREATE INDEX ON some_data (c_column);

The writers create the partition when it is unable to insert a row. The partitions are 15 minutes long. The data is ephemeral and is deleted periodically. We partition because deleting partitioned tables guarantees the data is physically removed; since our insert rate is fast, vacuuming never runs.

Furthermore, it's not important that all the data is written. Our database is mounted on a tmpfs partition, so it is in RAM. We're happy with the insert performance.

We have another process that continuously polls the above table to find the top b_columns in the past 30 seconds using this query:

SELECT b_column, count(*) as c FROM some_data
 WHERE ts >= NOW() - INTERVAL '30 SECONDS'
 AND a_column = 'a_value1'
 AND c_column IS NOT True
 AND b_column != 'b_value4' AND b_column != ''
 AND b_column != 'b_value1' AND b_column != 'b_value2'
AND b_column != 'b_value3'
GROUP BY b_column
HAVING count(*) > 15000
 ORDER BY c DESC
LIMIT 100;

The column values for the WHERE clause for this query are always the same.

Because of the amount of data we are querying, the query takes 1.5s to 2s complete. What can we do, in terms of restructuring the table or the query, to improve the performance?

Here is the analyze output:

 Limit  (cost=26488.73..26488.90 rows=67 width=30) (actual time=1210.177..1216.418 rows=4 loops=1)
   Output: some_data.b_column, (count(*))
   ->  Sort  (cost=26488.73..26488.90 rows=67 width=30) (actual time=1210.175..1216.415 rows=4 loops=1)
         Output: some_data.b_column, (count(*))
         Sort Key: (count(*)) DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Finalize GroupAggregate  (cost=26435.53..26486.70 rows=67 width=30) (actual time=1040.285..1216.384 rows=4 loops=1)
               Output: some_data.b_column, count(*)
               Group Key: some_data.b_column
               Filter: (count(*) > 15000)
               Rows Removed by Filter: 180403
               ->  Gather Merge  (cost=26435.53..26482.20 rows=400 width=30) (actual time=1003.876..1136.791 rows=236999 loops=1)
                     Output: some_data.b_column, (PARTIAL count(*))
                     Workers Planned: 2
                     Workers Launched: 1
                     ->  Sort  (cost=25435.51..25436.01 rows=200 width=30) (actual time=977.874..1037.332 rows=118500 loops=2)
                           Output: some_data.b_column, (PARTIAL count(*))
                           Sort Key: some_data.b_column
                           Sort Method: external merge  Disk: 5144kB
                           Worker 0:  actual time=994.460..1056.444 rows=115752 loops=1
                             Sort Method: external merge  Disk: 4912kB
                           ->  Partial HashAggregate  (cost=25425.86..25427.86 rows=200 width=30) (actual time=641.936..727.838 rows=118500 loops=2)
                                 Output: some_data.b_column, PARTIAL count(*)
                                 Group Key: some_data.b_column
                                 Batches: 5  Memory Usage: 8257kB  Disk Usage: 7184kB
                                 Worker 0:  actual time=637.586..726.780 rows=115752 loops=1
                                   Batches: 5  Memory Usage: 8257kB  Disk Usage: 7064kB
                                 ->  Parallel Index Scan using some_data_2024_02_13_14_15_ts_idx on my_db.some_data_2024_02_13_14_15 some_data  (cost=0.43..24392.95 rows=206582 width=22) (actual time=0.074..463.918 rows=382730 loops=2)
                                       Output: some_data.b_column
                                       Index Cond: (some_data.ts >= (now() - '00:00:30'::interval))
                                       Filter: ((some_data.c_column IS NOT TRUE) AND ((some_data.b_column)::text <> 'b_value4'::text) AND ((some_data.b_column)::text <> ''::text) AND ((some_data.b_column)::text <> 'b_value1'::text) AND ((some_data.b_column)::text <> 'b_value2'::text) AND ((some_data.b_column)::text <> 'b_value3'::text) AND (some_data.a_column = 'a_value1'::a_column))
                                       Rows Removed by Filter: 8091
                                       Worker 0:  actual time=0.097..466.999 rows=369674 loops=1
 Planning Time: 1.563 ms
 Execution Time: 1219.166 ms

Solution

  • Your query needs to read over 200000 rows from the table, so it will never be very fast. By increasing work_mem, you can save some time by avoiding temporary file I/O (Sort Method: external merge Disk: 5,144kB).

    There isn't really anything you can do about that if you need to query current data, except having enough RAM that all the data are cached.