Search code examples
sqlgoogle-bigquerychrome-ux-report

SQL nested data running total until value reached


I am trying to use a complex nested dataset on Google BigQuery using the tutorial here: https://developers.google.com/web/tools/chrome-user-experience-report/getting-started

The data exported as json looks like this: https://gist.github.com/kmturley/c46eb3898d6ee62871f4402a4d2c6f7d

The example i'm using which works is: https://bigquery.cloud.google.com/savedquery/226352634162:c5b7ee9ea0394728a821cf60f58737c2

#standardSQL
SELECT
  effective_connection_type.name AS ect,
  SUM((
    SELECT
      SUM(bin.density)
    FROM
      UNNEST(first_contentful_paint.histogram.bin) bin
    WHERE
      bin.end <= 1000
      AND origin = 'http://example.com')) AS density
FROM
  `chrome-ux-report.chrome_ux_report.201710`
GROUP BY
  ect
ORDER BY
  density DESC

This outputs the data:

1   4G  0.6977   
2   3G  0.056    
3   slow-2G null     
4   2G  null     
5   offline null

However I would like to output the bin.end value (time), when SUM(bin.density) reaches 0.7. My expected output would look something like this:

1   4G  1000     
2   3G  50000    
3   slow-2G null     
4   2G  null     
5   offline null

Meaning on 4G connection, 70% (0.7) of page loads are less than 1.5 seconds (1500). I have tried to modify the script to be:

SELECT
  SUM(bin.density)
WHERE
  SUM(bin.density) <= 0.7

But this is not allowed, so tried:

SELECT
  SUM(bin.density) AS RunningTotal
WHERE
  RunningTotal <= 0.7

And also tried

SELECT
  SUM(bin.density) OVER() AS RunningTotal
WHERE
  RunningTotal <= 0.7

But this doesn't work either! How can I implement a running total with nested datasets? and have it output the bin.end time?

If I can't get nested datasets to work with SQL running totals, then my only other option is to flatten the dataset, and loop through each row with Python to calculate the result. Which is far less performant!

Updated: Solution based on answer by Felipe Hoffa

#standardSQL
SELECT origin, form, ect, `end`, density
FROM (
  SELECT origin, form, ect, `end`, density, ROW_NUMBER() OVER(PARTITION BY ect ORDER BY `end` DESC) rn
  FROM (
    SELECT origin, form, ect, bin.end, SUM(bin.density) OVER(PARTITION BY ect ORDER BY `end`) AS density 
    FROM (
      SELECT origin, form_factor.name form, effective_connection_type.name ect, first_contentful_paint.histogram
      FROM    `chrome-ux-report.chrome_ux_report.201710`
      WHERE    origin = 'http://example.com' AND form_factor.name = 'phone'
    ) , UNNEST(histogram.bin) AS bin
  )
  WHERE density < 0.7
) 
WHERE rn=1

https://bigquery.cloud.google.com/savedquery/88263730615:ba1906e86b074511a804660ec973de37


Solution

  • With a cumulative SUM() and sorting the results by connection:

    #standardSQL
    SELECT ect, `end`, density
    FROM (
      SELECT ect, `end`, density, ROW_NUMBER() OVER(PARTITION BY ect ORDER BY `end` DESC) rn
      FROM (
        SELECT  ect, bin.end, SUM(bin.density) OVER(PARTITION BY ect ORDER BY `end`) AS density 
        FROM (
          SELECT  effective_connection_type.name ect, first_contentful_paint.histogram
          FROM    `chrome-ux-report.chrome_ux_report.201710`
          WHERE    origin = 'http://example.com'
        ) , UNNEST(histogram.bin) AS bin
      )
      WHERE density < 0.7
    ) 
    WHERE rn=1
    

    enter image description here