Search code examples
google-bigquerymedian

PERCENTILE_CONT and group by BigQuery


I want to calculate the means and medians of the column3, separately for different categories of column1 and column2. So basically I would like to calculate the means and medians grouped by column1 and column2.

The data looks something like this:

Table `xx.yy.zz`
column1  column2            column3
A            A1                1
A            A1                2
A            A1                3
B            B2                10
B            B2                15
B            B2                20
...

The desired output would be:

column1   column2             median3    mean3
A            A1                 2        2
A            A2               median    mean
A            A3               median    mean
B            B1               median    mean
B            B2                 15       15
C            C1               median    mean

I tried with the code below. The first part of the code (table1) does the right work while both trials for the second part (table2) don't work. What I am doing wrong? What would be the correct way to calculate the median of the column3 group by column1 and column2?

My code so far:

WITH  

table1 AS (SELECT column1, column2, 
AVG(column3) AS mean3
FROM xx.yy.zz
GROUP BY 1,2
),

table2 AS (SELECT column1, column2,
PERCENTILE_CONT(column3, 0.5) OVER(PARTITION BY column1, column2,)  AS median3
FROM xx.yy.zz
group by 1,2
),

**OR**

table2 AS (SELECT 
PERCENTILE_CONT(column3, 0.5) OVER(PARTITION BY column1, column2,)  AS median3
FROM xx.yy.zz
),

table3 AS (SELECT * FROM table1
INNER JOIN
(SELECT * FROM table2)
USING(column1, column2)
)

SELECT * FROM table3

Solution

  • Below is for BigQuery Standard SQL

    #standardsql
    create temp function median (arr any type) as (
      if(mod(array_length(arr), 2) = 0,
        ( arr[offset(div(array_length(arr), 2) - 1)] +
          arr[offset(div(array_length(arr), 2))])  / 2,
          arr[offset(div(array_length(arr), 2))] )
    );
    select column1, column2,
      median(array_agg(column3 order by column3)) as median3,
      avg(column3) as mean3
    from `xx.yy.zz`
    group by column1, column2   
    

    if to apply to sample data from your question - output is

    enter image description here