Search code examples
sqlpresto

PRESTO SQL Use Total Counts From one Table to Sample Another Table


I am trying to sample a table in presto using counts from a query. So effectively I am trying to do something like this. I have one table where I want to filter based on criteria, and use the total count of the filtered table to sample from the other table. I do not think the query below works in any shape or form, but it is effectively what I am trying to do. Can someone assist me here?

WITH less_100_table AS (Select * from Table A where my_total < 100),
WITH greater_100_table AS (Select * from Table A where my_total > 100),
total_count AS (Select count(*) from Table A)
counts_less_100 AS select count(*) from less_100_table
select * from greater_100_table TABLESAMPLE BERNOULLI ((counts_less_100/total_count)*100) 

Solution

  • I don't know whether TABLESAMPLE BERNOULLI allows a calculated percentage. I don't even know whether it allows to select a sample from a view. The documentation (https://prestodb.io/docs/current/sql/select.html#tablesample) is a tad short, not even mentioning where in the query the sample clause can be placed.

    Here is your query with the obvious syntax errors corrected. Maybe it will work, maybe not. I don't know. Just try.

    with less_100_table as (select * from a where my_total < 100),
         greater_100_table as (select * from a where my_total > 100)
    select *
    from greater_100_table
    tablesample bernoulli 
    (
      (select count(*) from less_100_table) * 100.0 / 
      (select count(*) from a) total_count)
    );
    

    If this runs and your real query is very much alike, you can also use conditional aggregation:

    select * 
    from a
    where my_total > 100
    tablesample bernoulli 
    (
      (select avg(case when my_total < 100 then 100.0 else 0.0 end) from a)
    );