I am attempting to sample 20% of a table in impala. I have heard somewhere that the built in impala sampling function has issues.
Is there a way to pass in a subquery to the impala limit function to sample n percent of the entire table.
I have something like this:
select
* from
table_a
order by rand()
limit
(
select
round( (count(distinct ids)) *.2,0)
from table_a)
)
The sub query gives me 20% of all records
I'm not sure if Impala has specific sampling logic (some databases do). But you can use window functions:
select a.*
from (select a.*,
row_number() over (order by rand()) as seqnum,
count(*) over () as cnt
from table_a
) a
where seqnum <= cnt * 0.2;