Search code examples
postgresqlsql-execution-planexplain

Count rows within a group, but also from global result set: performance issue


I have a table with log records. Each log record is represented by a status (open or closed) and a date:

CREATE TABLE logs (
  id          BIGSERIAL PRIMARY KEY,
  status      VARCHAR NOT NULL,
  inserted_at DATE NOT NULL
);

I need to get a daily report with a following information:

  1. how many log records with status = open were created,
  2. how many log records with status = closed were created,
  3. how many log records with status = open exist to this day, including this day.

Here's a sample report output:

    day     | created | closed | total_open
------------+---------+--------+------------
 2017-01-01 |       2 |      0 |          2
 2017-01-02 |       2 |      1 |          3
 2017-01-03 |       1 |      1 |          3
 2017-01-04 |       1 |      0 |          4
 2017-01-05 |       1 |      0 |          5
 2017-01-06 |       1 |      0 |          6
 2017-01-07 |       1 |      0 |          7
 2017-01-08 |       0 |      1 |          6
 2017-01-09 |       0 |      0 |          6
 2017-01-10 |       0 |      0 |          6
(10 rows)

I solved this in a very "dirty" way:

INSERT INTO logs (status, inserted_at) VALUES
  ('created', '2017-01-01'),
  ('created', '2017-01-01'),
  ('closed', '2017-01-02'),
  ('created', '2017-01-02'),
  ('created', '2017-01-02'),
  ('created', '2017-01-03'),
  ('closed', '2017-01-03'),
  ('created', '2017-01-04'),
  ('created', '2017-01-05'),
  ('created', '2017-01-06'),
  ('created', '2017-01-07'),
  ('closed', '2017-01-08');

  SELECT days.day,
         count(case when logs.inserted_at  = days.day AND logs.status = 'created' then 1 end) as created,
         count(case when logs.inserted_at  = days.day AND logs.status = 'closed' then 1 end) as closed,
         count(case when logs.inserted_at <= days.day AND logs.status = 'created' then 1 end) -
         count(case when logs.inserted_at <= days.day AND logs.status = 'closed' then 1 end) as total
    FROM (SELECT day::date FROM generate_series('2017-01-01'::date, '2017-01-10'::date, '1 day'::interval) day) days,
         logs
GROUP BY days.day
ORDER BY days.day;

Also (posted it on gist for brevity), and would like to improve the solution.

Right now explain for my query reveals some ridiculous cost numbers that that I would like to minimize (I don't have indexes yet).

How an efficient query to achieve the report above would look like?


Solution

  • A possible solution is to use window functions:

    select s.*, sum(created - closed) over (order by inserted_at)
    from   (select    inserted_at,
                      count(status) filter (where status = 'created') created,
                      count(status) filter (where status = 'closed')  closed
            from      (select d::date inserted_at
                       from   generate_series('2017-01-01'::date, '2017-01-10'::date, '1 day'::interval) d) d
            left join logs using (inserted_at)
            group by  inserted_at) s
    

    http://rextester.com/GFRRP71067

    Also, an index on (inserted_at, status) could help you a lot with this query.

    Note: count(...) filter (where ...) is really just a fancy way to write count(case when ... then ... [else null] end).