Search code examples
sqlhivehqlhiveqlhue

How to filter table based on percentile and then random sample in HQL?


I'm trying to random sample 200 rows from a table, but first I want to filter it to pick only top 1 percent values from a variable.

I'm getting the following error -

Error while compiling statement: FAILED: ParseException line 3:31 cannot recognize input near 'select' 'percentile_approx' '(' in expression specification

Below is my query-

> with sample_pop as (select * from
> mytable a where
> a.transaction_amount > (select
> percentile_approx(transaction_amount, 0.99) as top1
>                             from mytable) )
> 
> select * from sample_pop  distribute by rand(1) sort by rand(1) limit
> 200;

Solution

  • I don't think Hive supports scalar subqueries the way you are using them (only for IN/EXISTS). So move the logic to the FROM clause:

    with sample_pop as (
          select *
          from mytable a cross join
               (select percentile_approx(transaction_amount, 0.99) as top1
                from mytable
               ) aa
          where a.transaction_amount > aa.top1
         )
    select * 
    from sample_pop distribute by rand(1) 
    order by rand(1)
    limit 200;