I'm trying to random sample 200 rows from a table, but first I want to filter it to pick only top 1 percent values from a variable.
I'm getting the following error -
Error while compiling statement: FAILED: ParseException line 3:31 cannot recognize input near 'select' 'percentile_approx' '(' in expression specification
Below is my query-
> with sample_pop as (select * from
> mytable a where
> a.transaction_amount > (select
> percentile_approx(transaction_amount, 0.99) as top1
> from mytable) )
>
> select * from sample_pop distribute by rand(1) sort by rand(1) limit
> 200;
I don't think Hive supports scalar subqueries the way you are using them (only for IN
/EXISTS
). So move the logic to the FROM
clause:
with sample_pop as (
select *
from mytable a cross join
(select percentile_approx(transaction_amount, 0.99) as top1
from mytable
) aa
where a.transaction_amount > aa.top1
)
select *
from sample_pop distribute by rand(1)
order by rand(1)
limit 200;