Search code examples
sqlsql-servermedian

SQL how to calculate median not based on rows


I have a sample of cars in my table and I would like to calculate the median price for my sample with SQL. What is the best way to do it?

+-----+-------+----------+
| Car | Price | Quantity |
+-----+-------+----------+
| A   |   100 |        2 |
| B   |   150 |        4 |
| C   |   200 |        8 |
+-----+-------+----------+

I know that I can use percentile_cont (or percentile_disc) if my table is like this:

+-----+-------+
| Car | Price |
+-----+-------+
| A   |   100 |
| A   |   100 |
| B   |   150 |
| B   |   150 |
| B   |   150 |
| B   |   150 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
+-----+-------+

But in the real world, my first table has about 100 million rows and the second table should have about 3 billiard rows (and moreover I don't know how to transform my first table into the second).


Solution

  • Here is a way to do this in sql server

    In the first step i do is calculate the indexes corresponding to the lower and upper bounds for the median (if we have odd number of elements then the lower and upper bounds are same else its based on the x/2 and x/2+1th value)

    Then i get the cumulative sum of the quantity and the use that to choose the elements corresponding to the lower and upper bounds as follows

    with median_dt
      as (
    select case when sum(quantity)%2=0 then
                     sum(quantity)/2 
                else 
                     sum(quantity)/2 + 1
            end as lower_limit
          ,case when sum(quantity)%2=0 then
                     (sum(quantity)/2) + 1
                else 
                     sum(quantity)/2 + 1
            end as upper_limit  
     from t 
         )
        ,data
        as (
     select *,sum(quantity) over(order by price asc) as cum_sum
       from t
           )
       ,rnk_val
       as(select * 
           from (
                 select price,row_number() over(order by d.cum_sum asc) as rnk
                   from data d
                   join median_dt b
                     on b.lower_limit<=d.cum_sum
                 )x 
          where x.rnk=1
          union all 
         select * 
           from (
                 select price,row_number() over(order by d.cum_sum asc) as rnk
                   from data d
                   join median_dt b
                     on b.upper_limit<=d.cum_sum
                 )x 
          where x.rnk=1
          ) 
      select avg(price) as median
        from rnk_val
    
    
    
    +--------+
    | median |
    +--------+
    |    200 |
    +--------+
    

    db fiddle link https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=c5cfa645a22aa9c135032eb28f1749f6