When I try to use PERCENT_RANK() over a large dataset, it gives me an error.
SELECT
a2_lngram,
a2_decade,
a2_totalfreq,
a2_totalbooks,
a2_freq, a2_bfreq,
a2_arf,
c_avgarf,
d_arf,
oi,
PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
FROM [trigram.trigrams8]
With a destination table and AllowLargeResults returns:
"Resources exceeded during query execution."
When I limit the results to few hundreds it runs fine.
JobID: otichyproject1:job_PpTpmMXYETUMiM_2scGgc997JVg The dataset is public.
This is expected: The input for an analytic/window function needs to fit in one node for it to run successfully.
PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
will only run if all the rows fit in one node. If they don't you'll see the "Resources exceeded during query execution" error.
There's a way to scale up with analytic functions: Partition your data.
PERCENT_RANK() OVER (PARTITION BY country ORDER BY d_arf DESC) plarf
... then the function can be run over multiple nodes, as long as each 'country' rows fit in one VM.
Not your case though - the fix I would do here is calculate the total on a separate subquery, join, and divide.
In summary, analytic functions are cool, but they have scalability issues on the size of each partition - luckily there are other ways to get the same results.