Search code examples
sqlhadoopimpala

How to Pass Query Answer into Limit Function Impala


I am attempting to sample 20% of a table in impala. I have heard somewhere that the built in impala sampling function has issues.

Is there a way to pass in a subquery to the impala limit function to sample n percent of the entire table.

I have something like this:

select 
* from
table_a
order by rand()
limit
(
select 
round( (count(distinct ids)) *.2,0)
from table_a) 
)

The sub query gives me 20% of all records


Solution

  • I'm not sure if Impala has specific sampling logic (some databases do). But you can use window functions:

    select a.*
    from (select a.*,
                 row_number() over (order by rand()) as seqnum,
                 count(*) over () as cnt
          from table_a
         ) a
    where seqnum <= cnt * 0.2;