Creating histogram using Postgres 15, getting strange results.
My column has 25 records, depending on the nbins
value in the bin_params
cte I get different record counts when I sum the 'freq'
column. When nbins
= 12 or 13 I get too many (26). When nbins
= 11 I get the right count of 25.
Another approached I've tried, but led to similarly confusing results, is to multiply the min(x)
in the bin_params
cte by .99, .999, and .9999999. When I vary the # of decimal places I get differing record counts returned, same behavior as described above.
My desired solution is sum(freq)
= # or records in the column.
I've read/tried solutions from here, here and here, along with many others. The code below is as close as I've come to a workable solution.
Thank you for helping me.
Here's my code (sample data at the bottom):
with rnorm as
(SELECT
my_col::numeric as x
from my_table
),
bin_params as
(select
min(x) as min_x
,max(x) as max_x
,13 as nbins
from rnorm),
temp_bins as
(SELECT
generate_series(min_x::numeric, max_x::numeric, ((max_x - min_x) / nbins)::numeric) as bin
from bin_params
),
bin_range as
( select
lag(bin) over (order by bin) as low_bin
,b.bin as high_bin
from temp_bins b
),
frequency_table as
( select
b.low_bin
,b.high_bin
,count(*) as freq
from bin_range b
left join rnorm r
on r.x <= b.high_bin
and r.x > b.low_bin
where b.low_bin is not null
and b.high_bin is not null
group by
b.low_bin
,b.high_bin
,count_x
order by b.low_bin
)
select * from frequency_table
my_table:
|"my_col"|
|------|
|74.03|
|73.995|
|73.988|
|74.002|
|73.992|
|74.009|
|73.995|
|73.985|
|74.008|
|73.998|
|73.994|
|74.004|
|73.983|
|74.006|
|74.012|
|74|
|73.994|
|74.006|
|73.984|
|74|
|73.988|
|74.004|
|74.01|
|74.015|
|73.982|
There are three issues.
First, generate_series()
results may not contain the stop
value, it depends on the step. See the third example in Set Returning Functions. You can add step
to the second parameter to be sure that all values are in the generated ranges.
Second, the strict inequality r.x > b.low_bin
omits min_x
value. Use a bit smaller start
parameter for generated series.
Third, count(*) as freq
gives 1 even when there are no values in the range (because of left join). Should be count(x) as freq
.