I have a sample of cars in my table and I would like to calculate the median price for my sample with SQL. What is the best way to do it?
+-----+-------+----------+
| Car | Price | Quantity |
+-----+-------+----------+
| A | 100 | 2 |
| B | 150 | 4 |
| C | 200 | 8 |
+-----+-------+----------+
I know that I can use percentile_cont (or percentile_disc) if my table is like this:
+-----+-------+
| Car | Price |
+-----+-------+
| A | 100 |
| A | 100 |
| B | 150 |
| B | 150 |
| B | 150 |
| B | 150 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
| C | 200 |
+-----+-------+
But in the real world, my first table has about 100 million rows and the second table should have about 3 billiard rows (and moreover I don't know how to transform my first table into the second).
Here is a way to do this in sql server
In the first step i do is calculate the indexes corresponding to the lower and upper bounds for the median (if we have odd number of elements then the lower and upper bounds are same else its based on the x/2 and x/2+1th value)
Then i get the cumulative sum of the quantity and the use that to choose the elements corresponding to the lower and upper bounds as follows
with median_dt
as (
select case when sum(quantity)%2=0 then
sum(quantity)/2
else
sum(quantity)/2 + 1
end as lower_limit
,case when sum(quantity)%2=0 then
(sum(quantity)/2) + 1
else
sum(quantity)/2 + 1
end as upper_limit
from t
)
,data
as (
select *,sum(quantity) over(order by price asc) as cum_sum
from t
)
,rnk_val
as(select *
from (
select price,row_number() over(order by d.cum_sum asc) as rnk
from data d
join median_dt b
on b.lower_limit<=d.cum_sum
)x
where x.rnk=1
union all
select *
from (
select price,row_number() over(order by d.cum_sum asc) as rnk
from data d
join median_dt b
on b.upper_limit<=d.cum_sum
)x
where x.rnk=1
)
select avg(price) as median
from rnk_val
+--------+
| median |
+--------+
| 200 |
+--------+
db fiddle link https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=c5cfa645a22aa9c135032eb28f1749f6