Search code examples
postgresqlhistogramdate-rangeestimation

Estimating Number of rows using Explain in PostgreSQL for Daterange


Does any body knows how the PostgreSQL estimates the number of rows using Explain for date range data types? for instance consider we have this query:

Select * From 'Table1' Where 'period' && '[2015-01-01,2015-12-30)';

In this situation (when we have date range data type) for the period field in table Table1, the fields like Histogram bound in pg_stats are null , but in pg_statistic two value will be save (stakind1=7 and stakind2=6) which there is no documents for this codings, also two stavalues will be saved in the table which it seems one of them is the histogram of the field period but there other one? here is an example :

Stavalue1:"{""[2015-01-02,2015-01-03)"",""[2015-01-29,2015-02-01)"",""[2015-02-09,2015-02-13)""}"
Satvalue2: "{1,1,1}"

here I have three questions:

  1. What is Satvalue2?or what is stakind2=6?
  2. How can we interpret a histogram which upper bound and Lower bound are period?
  3. How 'Explain' can estimate the number of rows for the query I mentioned above?

Thanks in advance


Solution

  • See this definition in src/include/catalog/pg_statistic.h:

    /*
     * A "length histogram" slot describes the distribution of range lengths in
     * rows of a range-type column. stanumbers contains a single entry, the
     * fraction of empty ranges. stavalues is a histogram of non-empty lengths, in
     * a format similar to STATISTIC_KIND_HISTOGRAM: it contains M (>=2) range
     * values that divide the column data values into M-1 bins of approximately
     * equal population. The lengths are stored as float8s, as measured by the
     * range type's subdiff function. Only non-null rows are considered.
     */
    #define STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM  6
    
    /*
     * A "bounds histogram" slot is similar to STATISTIC_KIND_HISTOGRAM, but for
     * a range-type column.  stavalues contains M (>=2) range values that divide
     * the column data values into M-1 bins of approximately equal population.
     * Unlike a regular scalar histogram, this is actually two histograms combined
     * into a single array, with the lower bounds of each value forming a
     * histogram of lower bounds, and the upper bounds a histogram of upper
     * bounds.  Only non-NULL, non-empty ranges are included.
     */
    #define STATISTIC_KIND_BOUNDS_HISTOGRAM  7
    

    This answers the first and the second question.

    The selectivity of && is calculated in calc_hist_selectivity in src/backend/utils/adt/rangetypes_selfuncs.c. The bounds histogram is split in two histograms: hist_upper for the upper values and hist_lower for the lower values. This code explains what happens:

            case OID_RANGE_OVERLAP_OP:
            case OID_RANGE_CONTAINS_ELEM_OP:
    
                /*
                 * A && B <=> NOT (A << B OR A >> B).
                 *
                 * Since A << B and A >> B are mutually exclusive events we can
                 * sum their probabilities to find probability of (A << B OR A >>
                 * B).
                 *
                 * "range @> elem" is equivalent to "range && [elem,elem]". The
                 * caller already constructed the singular range from the element
                 * constant, so just treat it the same as &&.
                 */
                hist_selec =
                    calc_hist_selectivity_scalar(typcache, &const_lower, hist_upper,
                                                 nhist, false);
                hist_selec +=
                    (1.0 - calc_hist_selectivity_scalar(typcache, &const_upper, hist_lower,
                                                        nhist, true));
                hist_selec = 1.0 - hist_selec;
                break;