Search code examples
sqlpostgresqlquery-optimization

Query optimization possibilities?


I've embarked on attempting to optimize a rather large query that has 3 nested subqueries (like Russian dolls). The query itself is generated by south from a Django project, and I freely admit I'm no expert at SQL optimization. My strategy thus far is to start with the innermost query and work my way outwards.


Thus, the first and inner most query is

SELECT
  DISTINCT ON (quote_id) quote_id,
  MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
GROUP BY quote_id, created_at
ORDER BY quote_id, created_at DESC;

The EXPLAIN ANALYZE for the above is

 Unique  (cost=535905.10..610867.38 rows=3331657 width=12) (actual time=4364.469..7587.242 rows=1462625 loops=1)
   ->  GroupAggregate  (cost=535905.10..602538.24 rows=3331657 width=12) (actual time=4364.467..6996.550 rows=3331656 loops=1)
         Group Key: quote_id, created_at
         ->  Sort  (cost=535905.10..544234.24 rows=3331657 width=12) (actual time=4364.460..5574.351 rows=3331657 loops=1)
               Sort Key: quote_id, created_at
               Sort Method: external merge  Disk: 84648kB
               ->  Seq Scan on billing_pricequotestatus  (cost=0.00..61080.57 rows=3331657 width=12) (actual time=0.013..854.722 rows=3331657 loops=1)
 Planning time: 0.107 ms
 Execution time: 7759.317 ms
(9 rows)

Table structure is

                                    Table "public.billing_pricequotestatus"
   Column   |           Type           |                               Modifiers
------------+--------------------------+-----------------------------------------------------------------------
 id         | integer                  | not null default nextval('billing_pricequotestatus_id_seq'::regclass)
 created_at | timestamp with time zone | not null
 updated_at | timestamp with time zone | not null
 notes      | text                     | not null
 name       | character varying(20)    | not null
 quote_id   | integer                  | not null
Indexes:
    "billing_pricequotestatus_pkey" PRIMARY KEY, btree (id)
    "billing_pricequotestatus_quote_id" btree (quote_id)
    "status_timestamp_idx" btree (quote_id, created_at)
Foreign-key constraints:
    "quote_id_refs_id_2b0d5331de8d31b7" FOREIGN KEY (quote_id) REFERENCES billing_pricequote(id) DEFERRABLE INITIALLY DEFERRED

I've tried http://explain.depesz.com/, but I'm not entirely sure I know how to derive next-steps from the report. I've also found an article suggesting that the ORDER BY clause could be removed if the SELECT will return the rows in order anyways, which I think might be the case here? Unsure how to tell that.

If I remove the ORDER BY clause, that shaves off ~3410 ms, but I feel this should be faster (if I only do a straight SELECT with no aggregate function, DISTINCT or ordering, my baseline time appears to be 832.427 ms). I've seen several other SO posts in regards to tables 10x the size of mine getting 3-5x better performance with the right indexes. I know it's not an apples-to-apples comparison, always, but hoping for some insight anyways.


Solution

  • That's confusing, you create a distinct list of quote_id, created_at and you do a MAX(created_at) plus DISTINCT ON (quote_id)?

    This should return the same result:

    SELECT
      quote_id,
      MAX(created_at) AS max_created_at
    FROM billing_pricequotestatus
    GROUP BY quote_id