Search code examples
sqlpostgresqlcountrowcount

Fast way to discover the row count of a table in PostgreSQL


I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.

I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.

Is it possible to stop counting as soon as my constant value is surpassed?

I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.

Something like this:

SELECT text,count(*), percentual_calculus()  
FROM token  
GROUP BY text  
ORDER BY count DESC;

Solution

  • Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.

    (Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)

    Exact count

    Slow for big tables.
    With concurrent write operations, it may be outdated the moment you get it.

    SELECT count(*) AS exact_count FROM myschema.mytable;
    
    Estimate

    Extremely fast:

    SELECT reltuples::bigint AS estimate FROM pg_class where relname = 'mytable';
    

    Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.

    Safer estimate

    The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:

    SELECT c.reltuples::bigint AS estimate
    FROM   pg_class c
    JOIN   pg_namespace n ON n.oid = c.relnamespace
    WHERE  c.relname = 'mytable'
    AND    n.nspname = 'myschema';
    

    The cast to bigint formats the real number nicely, especially for big counts.

    Better estimate

    SELECT reltuples::bigint AS estimate
    FROM   pg_class
    WHERE  oid = 'myschema.mytable'::regclass;
    

    Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.

    Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:

    Better estimate yet (for very little added cost)

    This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
    You have to add up estimates for all partitions instead.

    We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:

    These numbers are current as of the last VACUUM or ANALYZE on the table. The planner then fetches the actual current number of pages in the table (this is a cheap operation, not requiring a table scan). If that is different from relpages then reltuples is scaled accordingly to arrive at a current number-of-rows estimate.

    Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:

    Minimal form

    SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
    FROM   pg_class
    WHERE  oid = 'mytable'::regclass;  -- your table here
    

    Safe and explicit

    SELECT (CASE WHEN c.reltuples < 0 THEN NULL       -- never vacuumed
                 WHEN c.relpages = 0 THEN float8 '0'  -- empty table
                 ELSE c.reltuples / c.relpages END
         * (pg_catalog.pg_relation_size(c.oid)
          / pg_catalog.current_setting('block_size')::int)
           )::bigint
    FROM   pg_catalog.pg_class c
    WHERE  c.oid = 'myschema.mytable'::regclass;      -- schema-qualified table here
    

    Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:

    If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.

    If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)

    If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)

    Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.

    Table and schema qualifications make it immune to any search_path and scope.

    Either way, the query consistently takes < 0.1 ms for me.

    More Web resources:


    TABLESAMPLE SYSTEM (n) in Postgres 9.5+

    SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
    

    Like @a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:

    • No autovacuum running.
    • Immediately after a large INSERT / UPDATE / DELETE.
    • TEMPORARY tables (which are not covered by autovacuum).

    This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:

    • Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
    • Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
    • General rounding errors.

    Typically, the estimate from pg_class will be faster and more accurate.

    Answer to actual question

    First, I need to know the number of rows in that table, if the total count is greater than some predefined constant,

    And whether it ...

    ... is possible at the moment the count pass my constant value, it will stop the counting (and not wait to finish the counting to inform the row count is greater).

    Yes. You can use a subquery with LIMIT:

    SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
    

    Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.