Search code examples
postgresqlhistogramaggregate-functionscommon-table-expressionwindow-functions

Postgres Histogram Adding Records


Creating histogram using Postgres 15, getting strange results.

My column has 25 records, depending on the nbins value in the bin_params cte I get different record counts when I sum the 'freq' column. When nbins = 12 or 13 I get too many (26). When nbins = 11 I get the right count of 25.

Another approached I've tried, but led to similarly confusing results, is to multiply the min(x) in the bin_params cte by .99, .999, and .9999999. When I vary the # of decimal places I get differing record counts returned, same behavior as described above.

My desired solution is sum(freq) = # or records in the column.

I've read/tried solutions from here, here and here, along with many others. The code below is as close as I've come to a workable solution.

Thank you for helping me.

Here's my code (sample data at the bottom):

with rnorm as
                (SELECT 
                        my_col::numeric as x
                        from my_table
                ),
bin_params as
                (select 
                        min(x) as min_x 
                        ,max(x) as max_x
                        ,13 as nbins
                from rnorm),
temp_bins as
                (SELECT
                        generate_series(min_x::numeric, max_x::numeric, ((max_x - min_x) / nbins)::numeric) as bin
                from bin_params
                ),
bin_range as
                ( select
                        lag(bin) over (order by bin) as low_bin
                        ,b.bin as high_bin

                    from temp_bins b
                ),
frequency_table as
        (   select
                        b.low_bin
                        ,b.high_bin
                        ,count(*) as freq
                from bin_range b
                left join rnorm r
                        on r.x <= b.high_bin
                        and r.x > b.low_bin
                where   b.low_bin is not null
                        and b.high_bin is not null
                group by
                        b.low_bin
                        ,b.high_bin
                        ,count_x
                order by b.low_bin
        )

select * from frequency_table

my_table:

|"my_col"|
|------|
|74.03|
|73.995|
|73.988|
|74.002|
|73.992|
|74.009|
|73.995|
|73.985|
|74.008|
|73.998|
|73.994|
|74.004|
|73.983|
|74.006|
|74.012|
|74|
|73.994|
|74.006|
|73.984|
|74|
|73.988|
|74.004|
|74.01|
|74.015|
|73.982|

Solution

  • There are three issues.

    First, generate_series() results may not contain the stop value, it depends on the step. See the third example in Set Returning Functions. You can add step to the second parameter to be sure that all values are in the generated ranges.

    Second, the strict inequality r.x > b.low_bin omits min_x value. Use a bit smaller start parameter for generated series.

    Third, count(*) as freq gives 1 even when there are no values in the range (because of left join). Should be count(x) as freq.