Search code examples
statisticsgoogle-bigqueryquantilepercentilequartile

Google BigQuery APPROX_QUANTILES and getting true quartiles


According to the docs:

Returns the approximate boundaries for a group of expression values, where number represents the number of quantiles to create. This function returns an array of number + 1 elements, where the first element is the approximate minimum and the last element is the approximate maximum.

It sounds like if I want true quartiles, I need to use APPROX_QUANTILES(values, 4) which will return [minvalue, 1st quartile, 2nd quartile, 3rd quartile, maxvalue]

As according to https://en.wikipedia.org/wiki/Quartile, quartile sets contain 3 data points - none of which is the min/max values of the data.

Is my assumption correct? Is APPROX_QUANTILES(values, 4) going to return the true quartiles?


Solution

  • As a baseline, this is the output without any modification, using an input of numbers between 1 and 100:

    SELECT APPROX_QUANTILES(x, 4) AS output
    FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;
    +----------------------------+
    |           output           |
    +----------------------------+
    | ["1","25","50","75","100"] |
    +----------------------------+
    

    The output includes both the minimum (1) and the maximum (100). If you just want the quartiles, you need to strip them from the array. For the sake of readability/composability, it's best to do this using a temporary SQL UDF. Here I'm using INT64 for the element type, but you could have a different element type, alternatively:

    CREATE TEMP FUNCTION StripFirstLast(arr ARRAY<INT64>) AS (
      ARRAY(SELECT x FROM UNNEST(arr) AS x WITH OFFSET
      WHERE OFFSET BETWEEN 1 AND ARRAY_LENGTH(arr) - 2)
    );
    
    SELECT
      APPROX_QUANTILES(x, 4) AS output,
      StripFirstLast(APPROX_QUANTILES(x, 4)) AS quartiles
    FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;
    +----------------------------+------------------+
    |           output           |    quartiles     |
    +----------------------------+------------------+
    | ["1","25","50","75","100"] | ["25","50","75"] |
    +----------------------------+------------------+
    

    You can see that the quartiles array contains only the desired values.