Search code examples
sqlpostgresqlcommon-table-expressionrecursive-querysql-limit

How to guarantee that at least N rows are returned by recursive CTE in Postgres


Most resources that describe a SELECT TOP ... query in Postgres say that you should use LIMIT instead, possibly with an ORDER BY clause if you need to select the top elements by some ordering.

What do you do if you need to select the top N elements from a recursive query, where there is no ordering and there is a possibility the query can return fewer than N rows without the recursion (so that the TOP part is necessary to ensure the result set is at least N rows, whereas LIMIT can allow fewer rows)?

My specific use case is a modification of the dynamic SQL pattern for selecting a random subsample of a table.

Here is a link to the sql source of my modification. The easiest thing is to look at the final function defined there, _random_select. It follows very closely to the above link, but has been modified to be polymorphic in the input table and output result set, as well as correctly accounting for the need to return only the columns that already existed in the input table (yet another dynamic SQL hack to exclude the interim row_number result from the final result set).

It's an eyesore, but it's the closest thing I have to a reproducible example. If you use _random_select and attempt to get something around 4500 rows from a table larger than 4500 rows, you begin to see smaller result sets with high probability, and it only gets worse as you increase the size of your desired sample (because the occurrence of duplicates gets worse as your desired sample gets larger).

Note that in my modification, I am not using the _gaps trick from this link, meant to over-sample to offset sampling inefficiency if there are gaps in a certain index column. That part doesn't relate to this question, and in my case I am using row_number to ensure that there is an integer column with no possible gaps.

The CTE is recursive, to ensure that if the first, non-recursive part of the CTE doesn't give you enough rows (because of duplicates removed by the UNION), then it will go back through another round of recursive call of the CTE, and keep tacking on more results until you've got enough.

In the linked example, LIMIT is used, but I am finding that this does not work. The method returns fewer results because LIMIT is only an at most N rows guarantee.

How do you get an at least N rows guarantee? Selecting the TOP N rows seemed like the natural way to do this (so that the recursive CTE had to keep chugging along until it gets enough rows to satisfy the TOP condition), but this is not available in Postgres.


Solution

  • Your assessment is to the point. The recursive query in my referenced answer is only somewhat more flexible than the original simple query. It still requires relatively few gaps in the ID space and a sample size that is substantially smaller than the table size to be reliable.

    While we need a comfortable surplus ("limit + buffer") in the simple query to cover the worst case scenario of misses and duplicates, we can work with a smaller surplus that's typically enough - since we have the safety net of a recursive query that will fill in if we should fall short of the limit in the first pass.

    Either way, the technique is intended to get a small random selection from a big table quickly.

    The technique is pointless for cases with too many gaps or (your focus) a sample size that is too close to the total table size - so that the recursive term might run dry before the limit is reached. For such cases a plain old:

    SELECT *   -- or DISTINCT * to fold duplicates like UNION does
    FROM   TABLE
    ORDER  BY random()
    LIMIT  n;
    

    .. is more efficient: you are going to read most of the table anyway.