Search code examples
postgresqlpostgresql-performance

postgresql - are my indexes or column types slowing down my query?


I have a table I created locally to use some of PG's window functions on a dataset that has about 4 million rows (originally a text file). Each row corresponds to a customer order.

CREATE TABLE orders
(
  orderid integer,
  customerid integer,
  orderdate date,
  status text,
  amount money,
  tax money,
  customername text,
  customerstate text

I have the database running locally in a i7 8gb RAM Windows 8 machine. I have btree indexes (indices?) on orderid, customerid and orderdate.

When I run the following query, it takes 300 seconds (appx). I was hoping with some basic tuning I could get it down to a minute, but I'm not a DBA. Anyone have tips?

select orderid, customername, orderdate, 
rank() OVER (PARTITION BY customername ORDER BY orderdate ASC) as cust_ord_nbr
from orders

Solution

  • Covering index

    Partition by customerid, like @Daniel commented. integer is smaller and cheaper to sort. If you don't need customername in the result, replace it with customerid completely.

    A multicolumn index can help (like @wildplasser commented). If it's a (mostly) read-only table, a "covering" index that allows index-only scans would be even faster - especially if included columns are small:

    CREATE INDEX orders_nbr_idx ON orders (customerid, orderdate, orderid);
    

    Adding orderid to the index only makes sense if you get index-only scans out of it. If you need the customername, add that, too. More:

    If it's (mostly) a read-only table, execute the expensive query once and save the snapshot as MATERIALIZED VIEW for reuse ...

    Peanuts

    You can do a couple of small things to reduce the memory footprint. After playing column tetris, this would save 0-7 bytes per row currently lost to padding:

    CREATE TABLE orders (
      orderid integer,
      customerid integer,
      amount money,
      tax money,
      orderdate date,
      status text,
      customername text,
      customerstate text
      );
    

    If you write the result to another table (or MATERIALIZED VIEW), it would save a bit to optimize the query in a similar fashion. rank() produces a bigint, by casting to int you save 8 bytes per row (4 + 4 padding):

    SELECT orderid, customername, orderdate
        -- orderid, customerid, orderdate  -- good enough?
         , rank() OVER (PARTITION BY customerid
                        ORDER BY orderdate)::int AS cust_ord_nbr
    FROM   orders;