Search code examples
sqlpostgresqlpostgresql-9.3postgresql-performancesql-execution-plan

Postgresql ignoring index on timestamp column even if query is faster using index


On postgresql 9.3, I have a table with a little over a million records, the table was created as:

CREATE TABLE entradas
(
 id serial NOT NULL,
 uname text,
 contenido text,
 fecha date,
 hora time without time zone,
 fecha_hora timestamp with time zone,
 geom geometry(Point,4326),
 CONSTRAINT entradas_pkey PRIMARY KEY (id)
)
WITH (
 OIDS=FALSE
);
ALTER TABLE entradas
OWNER TO postgres;

CREATE INDEX entradas_date_idx
 ON entradas
 USING btree
 (fecha_hora);

CREATE INDEX entradas_gix
 ON entradas
 USING gist
 (geom);

I'm executing a query to aggregate rows on time intervals as follows:

WITH x AS (
        SELECT t1, t1 + interval '15min' AS t2
        FROM   generate_series('2014-12-02 0:0' ::timestamp
                  ,'2014-12-02 23:45' ::timestamp, '15min') AS t1
        )

    select distinct
        x.t1,
        count(t.id) over w
    from x
    left join entradas  t  on t.fecha_hora >= x.t1
            AND t.fecha_hora < x.t2
    window w as (partition by x.t1)
    order by x.t1

This query takes about 50 seconds. From the output of explain, you can see that the timestamp index is not used:

Unique  (cost=86569161.81..87553155.15 rows=131199111 width=12)
 CTE x
   ->  Function Scan on generate_series t1  (cost=0.00..12.50 rows=1000 width=8)
   ->  Sort  (cost=86569149.31..86897147.09 rows=131199111 width=12)
     Sort Key: x.t1, (count(t.id) OVER (?))
     ->  WindowAgg  (cost=55371945.38..57667929.83 rows=131199111 width=12)
           ->  Sort  (cost=55371945.38..55699943.16 rows=131199111 width=12)
                 Sort Key: x.t1
                 ->  Nested Loop Left Join  (cost=0.00..26470725.90 rows=131199111 width=12)
                       Join Filter: ((t.fecha_hora >= x.t1) AND (t.fecha_hora < x.t2))
                       ->  CTE Scan on x  (cost=0.00..20.00 rows=1000 width=16)
                       ->  Materialize  (cost=0.00..49563.88 rows=1180792 width=12)
                             ->  Seq Scan on entradas t  (cost=0.00..37893.92 rows=1180792 width=12)

However, if i do set enable_seqscan=false (I know, one should never do this), then the query executes in less than a second and the output of explain shows that it is using the index on the timestamp column:

Unique  (cost=91449584.16..92433577.50 rows=131199111 width=12)
CTE x
  ->  Function Scan on generate_series t1  (cost=0.00..12.50 rows=1000 width=8)
->  Sort  (cost=91449571.66..91777569.44 rows=131199111 width=12)
      Sort Key: x.t1, (count(t.id) OVER (?))
      ->  WindowAgg  (cost=60252367.73..62548352.18 rows=131199111 width=12)
            ->  Sort  (cost=60252367.73..60580365.51 rows=131199111 width=12)
                  Sort Key: x.t1
                  ->  Nested Loop Left Join  (cost=1985.15..31351148.25 rows=131199111 width=12)
                       ->  CTE Scan on x  (cost=0.00..20.00 rows=1000 width=16)
                        ->  Bitmap Heap Scan on entradas t  (cost=1985.15..30039.14 rows=131199 width=12)
                              Recheck Cond: ((fecha_hora >= x.t1) AND (fecha_hora < x.t2))
                              ->  Bitmap Index Scan on entradas_date_idx  (cost=0.00..1952.35 rows=131199 width=0)
                                   Index Cond: ((fecha_hora >= x.t1) AND (fecha_hora < x.t2))

Why is postgres not using entradas_date_idx unless I force it to even if executing the query is way faster using it?

How could I make postgres use entradas_date_idx without resorting to set enable_seqscan=false?


Solution

  • You can simplify your query quite a bit:

    SELECT x.t1, count(*) AS ct
    FROM   generate_series('2014-12-02'::timestamp
                         , '2014-12-03'::timestamp
                         , '15 min'::interval) x(t1)
    LEFT   JOIN entradas t ON t.fecha_hora >= x.t1
                          AND t.fecha_hora <  x.t1 + interval '15 min' 
    GROUP  BY 1
    ORDER  BY 1;
    

    DISTINCT in combination with a window function is typically much more expensive (and also harder to estimate) for the query planner.

    The CTE is not necessary and typically more expensive than a subquery. And also harder to estimate for the query planner since CTEs are optimization barriers.

    It looks like you want to cover a whole day, but you were missing out on the last 15 minutes. Use a simpler generate_series() expression to cover the whole day (still not overlapping with adjacent days).

    Next, why do you have fecha_hora timestamp with time zone, while you also have have fecha date and hora time [without time zone]? Looks like it should be fecha_hora timestamp and drop the redundant columns?
    This would also avoid the subtle difference to the data type of your generate_series() expression - which should not normally be a problem, but timestamp depends on the time zone of your session and is not IMMUTABLE like timestamptz.

    If that's sill not good enough, add a redundant WHERE condition as advised by @Daniel to instruct the query planner.

    Basic advise for bad plans is applicable as well: