Search code examples
sqldatabaserdbms

Count number of rows returned in a SQL statement


Are there any DB engines that allow you to run an EXPLAIN (or other function) where it will give you an approximate count of values that may be returned before an aggregation is run (not rows scanned but that actually would be returned)? For example, in the following query:

SELECT gender, COUNT(1) FROM sales JOIN (
       SELECT id, person FROM sales2 WHERE country='US'
       GROUP BY person_id
) USING (id)
WHERE sales.age > 20
GROUP BY gender

Let's say this query returns 3 rows after being aggregated, but would return 170M rows if unaggregated.

Are there any tools where you can run the query to get this '170M' number or does this have to do with complexity theory (or something similar) where it's almost just as expensive to run the query (without the final aggregation/having/sort/limit/etc) to get the count? In other words, doing a rewrite to:

SELECT COUNT(1) FROM sales JOIN (
       SELECT id, person FROM sales2 WHERE country='US'
       GROUP BY person_id
) USING (id)
WHERE sales.age > 20

But having to execute the query nonetheless.

As an example of using the current (mysql) explain to show how 'off' it is to get what I'm looking for:

explain select * from movies where title>'a';
# rows=147900
select count(1) from _tracktitle where title>'a';
# 144647 --> OK, pretty close


explain select * from movies where title>'u';
# rows=147900
select * from movies where title>'u';
# 11816 --> Not close at all

Solution

  • Assuming you can use MS SQL Server, you could tap into the same data the Optimiser is using for cardinality estimation: DBCC SHOW_STATISTICS (table, index) WITH HISTOGRAM

    Part of data sets you get back is per-column histogram, which is essentially number of rows for each value range found in the table.

    You probably want to query the data programmatically, one way to achieve this would be to insert it into a temp table:

    CREATE TABLE #histogram (
        RANGE_HI_KEY datetime PRIMARY KEY,
        RANGE_ROWS INT,
        EQ_ROWS INT,
        DISTINCT_RANGE_ROWS INT,
        AVG_RANGE_ROWS FLOAT
    )
    
    INSERT INTO #histogram 
    EXEC ('DBCC SHOW_STATISTICS (Users, CreationDate) WITH HISTOGRAM')
    
    SELECT 'Estimate', SUM(RANGE_ROWS+EQ_ROWS) FROM #histogram WHERE RANGE_HI_KEY BETWEEN '2010-08-30 08:28:45.070' AND '2010-09-20 22:15:33.603'
    UNION ALL
    select 'Actual', COUNT(1) from Users u WHERE u.CreationDate BETWEEN '2010-08-30 08:28:45.070' AND '2010-09-20 22:15:33.603'
    

    For example, check out what this same query run against Stack Overflow Database.

    | -------- | ----- |
    | Estimate | 98092 |
    | Actual   | 11715 |
    

    it seems like a lot but then keep in mind that the whole table has almost 15mil records.

    A note on precision and other gotchas

    • The maximum number of histogram steps is capped at 200 - which is not a lot, so you are not getting guaranteed 10% margin of error, but neither does SQL Server.
    • As you insert data into table, histograms may get stale so your results would get skewed even more.
    • There are different ways to update this data, some are reasonably quick while others effectively require full table scan
    • not all columns will have statistics. You can either create it manually or (I believe) it gets created automatically if you run a search with the column as predicate