Search code examples
google-cloud-platformgoogle-bigquery

PERCENT_RANK() in BigQuery returns Resources exceeded


When I try to use PERCENT_RANK() over a large dataset, it gives me an error.

SELECT 
  a2_lngram, 
  a2_decade, 
  a2_totalfreq, 
  a2_totalbooks, 
  a2_freq, a2_bfreq, 
  a2_arf, 
  c_avgarf, 
  d_arf, 
  oi, 
  PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf 
FROM [trigram.trigrams8]

With a destination table and AllowLargeResults returns:

"Resources exceeded during query execution."

When I limit the results to few hundreds it runs fine.

JobID: otichyproject1:job_PpTpmMXYETUMiM_2scGgc997JVg The dataset is public.


Solution

  • This is expected: The input for an analytic/window function needs to fit in one node for it to run successfully.

    PERCENT_RANK() OVER (ORDER BY d_arf DESC) plarf
    

    will only run if all the rows fit in one node. If they don't you'll see the "Resources exceeded during query execution" error.

    There's a way to scale up with analytic functions: Partition your data.

    PERCENT_RANK() OVER (PARTITION BY country ORDER BY d_arf DESC) plarf
    

    ... then the function can be run over multiple nodes, as long as each 'country' rows fit in one VM.

    Not your case though - the fix I would do here is calculate the total on a separate subquery, join, and divide.

    In summary, analytic functions are cool, but they have scalability issues on the size of each partition - luckily there are other ways to get the same results.