Search code examples
postgresqldate-histogram

Generate date-histogram over table


I wrote a query that returns a bar graph in the terminal using Postgres CLI. The query is slow and inefficient. I would like to change that.

At the base, we have a pretty simple query. We want each row to be a division of the total number of rows in our table. Let's say that our hardcoded number of rows is N_ROWS, and our table is my_table.

Also, let's say N_ROWS equals 8.

select
    (select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level 
from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)

In my case, this returns my Y-Axis of the chart as:

 level
-------
 71760
 62790
 53820
 44850
 35880
 26910
 17940
  8970
     0

You can see the issues with that query already.

Can I programmatically generate a number of rows using N_ROWS and not hardcode each row value in VALUES? I also don't like how I perform a new count over my whole table for each row, obviously.

We now need our X-Axis, and this is what I came up with:

select
    r.level,
    case
        when (
            select count(id) from my_table where created_at_utc<= '2019-01-01 00:00:00'::timestamp without time zone
        ) >= r.level then true
    end as "2019-01-01"
from (
    select (select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)
) as r;

Which returns our first bucket:

 level | 2019-01-01
-------+------------
 71760 |
 62790 |
 53820 |
 44850 |
 35880 |
 26910 | t
 17940 | t
  8970 | t
     0 | t

I'd rather not hardcode a case statement for each bucket, but, of course, that's what I did. The results are what I was looking for.

 level | 2019-01-01 | 2019-02-01 | 2019-03-01 | 2019-04-01 | 2019-05-01 | 2019-06-01 | 2019-07-01 | 2019-08-01 | 2019-09-01 | 2019-10-01 | 2019-11-01 | 2019-12-01
-------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
 71760 |            |            |            |            |            |            |            |            |            |            |            | t
 62790 |            |            |            |            | t          | t          | t          | t          | t          | t          | t          | t
 53820 |            |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t
 44850 |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 35880 |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 26910 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
 17940 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
  8970 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
     0 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t

Solution

  • There are certainly a few improvements we can make.

    First, let's make a test table with some data:

    CREATE TABLE test (id bigint, dt date);
    -- Add 1 million rows
    insert into test select generate_series(1,100000, 1);
    -- Add dates from 2019-01-01 to 2019-01-11
    update test set dt='2019-01-01'::date + (id/10000)::int;
    

    We can almost substitute your first query to find the levels with this much faster query:

       SELECT unnest(percentile_disc(
                      (
                          SELECT array_agg(x) 
                          FROM generate_series(0, 1, (1::numeric)/8) as g(x))
                      ) WITHIN GROUP (ORDER BY id)
                     ) as l
        FROM test;
       l
    --------
          1
      12500
      25000
      37500
      50000
      62500
      75000
      87500
     100000
    (9 rows)
    

    Note that the first level is 1 instead of 0, but the rest should be the same.

    There are a few other tricks that we can employ:

    • We'll get our list of dates from generate_series
    • We can group the data in test by day (or date_trunc(timestamp) if you have timestamp data) and count the ids. Using a window function over this count will give us a cumulative sum of ids per day.
    • We can use \crosstabview in psql to pivot the resulting query
    WITH num_levels AS (
        SELECT 8 as num_levels
    ), levels as (
       SELECT unnest(percentile_disc(
                      (
                          SELECT array_agg(x) 
                          FROM num_levels
                          CROSS JOIN LATERAL generate_series(0, 1, (1::numeric)/num_levels.num_levels) as g(x))
                      ) WITHIN GROUP (ORDER BY id)
                     ) as l
        FROM test
    ), dates as (
      SELECT d
      FROM generate_series('2019-01-01T00:00:00'::timestamp, '2019-01-11T00:00:00'::timestamp, '1 day') as g(d)
    ), counts_per_day AS (
      SELECT dt, 
             sum(counts) OVER (ORDER BY dt) as cum_sum -- the cumulative count
      FROM (
        SELECT dt, 
        count(id) as counts -- The count per day
        FROM test
        GROUP BY dt
      ) sub
    )
    SELECT l, dt, CASE WHEN cum_sum >= l THEN true ELSE null END
    FROM levels, dates
    LEFT JOIN counts_per_day ON dt = d
    ORDER BY l DESC, d asc
    \crosstabview
       l    | 2019-01-01 | 2019-01-02 | 2019-01-03 | 2019-01-04 | 2019-01-05 | 2019-01-06 | 2019-01-07 | 2019-01-08 | 2019-01-09 | 2019-01-10 | 2019-01-11
    --------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
     100000 |            |            |            |            |            |            |            |            |            |            | t
      87500 |            |            |            |            |            |            |            |            | t          | t          | t
      75000 |            |            |            |            |            |            |            | t          | t          | t          | t
      62500 |            |            |            |            |            |            | t          | t          | t          | t          | t
      50000 |            |            |            |            |            | t          | t          | t          | t          | t          | t
      37500 |            |            |            | t          | t          | t          | t          | t          | t          | t          | t
      25000 |            |            | t          | t          | t          | t          | t          | t          | t          | t          | t
      12500 |            | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
          1 | t          | t          | t          | t          | t          | t          | t          | t          | t          | t          | t
    (9 rows)
    

    That query ran in 40ms on my laptop.

    The dates could be selected from the max and min of the dates in the test table and the interval could be changed from 1 day depending on how many columns are desired between the max and min.