I wrote a query that returns a bar graph in the terminal using Postgres CLI. The query is slow and inefficient. I would like to change that.
At the base, we have a pretty simple query. We want each row to be a division of the total number of rows in our table. Let's say that our hardcoded number of rows is N_ROWS
, and our table is my_table
.
Also, let's say N_ROWS
equals 8.
select
(select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level
from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)
In my case, this returns my Y-Axis of the chart as:
level
-------
71760
62790
53820
44850
35880
26910
17940
8970
0
You can see the issues with that query already.
Can I programmatically generate a number of rows using N_ROWS
and not hardcode each row value in VALUES
? I also don't like how I perform a new count over my whole table for each row, obviously.
We now need our X-Axis, and this is what I came up with:
select
r.level,
case
when (
select count(id) from my_table where created_at_utc<= '2019-01-01 00:00:00'::timestamp without time zone
) >= r.level then true
end as "2019-01-01"
from (
select (select count(id) from my_table) / N_ROWS * (N_ROWS - num) as level from (VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8)) as t (num)
) as r;
Which returns our first bucket:
level | 2019-01-01
-------+------------
71760 |
62790 |
53820 |
44850 |
35880 |
26910 | t
17940 | t
8970 | t
0 | t
I'd rather not hardcode a case statement for each bucket, but, of course, that's what I did. The results are what I was looking for.
level | 2019-01-01 | 2019-02-01 | 2019-03-01 | 2019-04-01 | 2019-05-01 | 2019-06-01 | 2019-07-01 | 2019-08-01 | 2019-09-01 | 2019-10-01 | 2019-11-01 | 2019-12-01
-------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
71760 | | | | | | | | | | | | t
62790 | | | | | t | t | t | t | t | t | t | t
53820 | | | | t | t | t | t | t | t | t | t | t
44850 | | | t | t | t | t | t | t | t | t | t | t
35880 | | t | t | t | t | t | t | t | t | t | t | t
26910 | t | t | t | t | t | t | t | t | t | t | t | t
17940 | t | t | t | t | t | t | t | t | t | t | t | t
8970 | t | t | t | t | t | t | t | t | t | t | t | t
0 | t | t | t | t | t | t | t | t | t | t | t | t
There are certainly a few improvements we can make.
First, let's make a test table with some data:
CREATE TABLE test (id bigint, dt date);
-- Add 1 million rows
insert into test select generate_series(1,100000, 1);
-- Add dates from 2019-01-01 to 2019-01-11
update test set dt='2019-01-01'::date + (id/10000)::int;
We can almost substitute your first query to find the levels with this much faster query:
SELECT unnest(percentile_disc(
(
SELECT array_agg(x)
FROM generate_series(0, 1, (1::numeric)/8) as g(x))
) WITHIN GROUP (ORDER BY id)
) as l
FROM test;
l
--------
1
12500
25000
37500
50000
62500
75000
87500
100000
(9 rows)
Note that the first level is 1 instead of 0, but the rest should be the same.
There are a few other tricks that we can employ:
WITH num_levels AS (
SELECT 8 as num_levels
), levels as (
SELECT unnest(percentile_disc(
(
SELECT array_agg(x)
FROM num_levels
CROSS JOIN LATERAL generate_series(0, 1, (1::numeric)/num_levels.num_levels) as g(x))
) WITHIN GROUP (ORDER BY id)
) as l
FROM test
), dates as (
SELECT d
FROM generate_series('2019-01-01T00:00:00'::timestamp, '2019-01-11T00:00:00'::timestamp, '1 day') as g(d)
), counts_per_day AS (
SELECT dt,
sum(counts) OVER (ORDER BY dt) as cum_sum -- the cumulative count
FROM (
SELECT dt,
count(id) as counts -- The count per day
FROM test
GROUP BY dt
) sub
)
SELECT l, dt, CASE WHEN cum_sum >= l THEN true ELSE null END
FROM levels, dates
LEFT JOIN counts_per_day ON dt = d
ORDER BY l DESC, d asc
\crosstabview
l | 2019-01-01 | 2019-01-02 | 2019-01-03 | 2019-01-04 | 2019-01-05 | 2019-01-06 | 2019-01-07 | 2019-01-08 | 2019-01-09 | 2019-01-10 | 2019-01-11
--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------
100000 | | | | | | | | | | | t
87500 | | | | | | | | | t | t | t
75000 | | | | | | | | t | t | t | t
62500 | | | | | | | t | t | t | t | t
50000 | | | | | | t | t | t | t | t | t
37500 | | | | t | t | t | t | t | t | t | t
25000 | | | t | t | t | t | t | t | t | t | t
12500 | | t | t | t | t | t | t | t | t | t | t
1 | t | t | t | t | t | t | t | t | t | t | t
(9 rows)
That query ran in 40ms on my laptop.
The dates could be selected from the max and min of the dates in the test table and the interval could be changed from 1 day depending on how many columns are desired between the max and min.