Search code examples
sqldb2aggregate-functionsmedian

DB2 SQL - median with GROUP BY


First of all, I am running on DB2 for i5/OS V5R4. I have ROW_NUMBER(), RANK() and common table expressions. I do not have TOP n PERCENT or LIMIT OFFSET.

The actual data set I'm working with is hard to explain, so let's just say I have a weather history table where the columns are (city, temperature, timestamp). I want to compare medians to averages for each group (city).

This was the cleanest way I found to get a median for a whole table aggregation. I adapted it from the IBM Redbook here:

WITH base_t AS
( SELECT temp, row_number() over (order by temperature) AS rownum FROM t ),
count_t AS
( SELECT COUNT(temperature) + 1 AS base_count FROM base_t ),
median_t AS
( SELECT temperature FROM base_t, count_t
  WHERE rownum in (FLOOR(base_count/2e0), CEILING(base_count/2e0)) )
SELECT DECIMAL(AVG(temperature),10,2) AS median FROM median_t

That works well for getting a single row back, but it seems to fall apart for grouping. Conceptually, this is what I want:


SELECT city, AVG(temperature), MEDIAN(temperature) FROM ...

city           | mean_temp       | median_temp       
===================================================
'Minneapolis'  | 60              | 64
'Milwaukee'    | 65              | 66
'Muskegon'     | 70              | 61

There could be an answer that makes me look stupid, but I'm having a mental block and this isn't my #1 thing to work on right now. Seems like it could be possible, but I can't use something that's extremely complex since it's a large table and I want the ability to customize which columns are being aggregated.


Solution

  • In SQL Server, agreagate functions like count(*) can be partitioned and calculated without a group by. I looked quickly through the referenced redbook, and it looks like DB2 has the same feature. But if not, then this won't work:

    create table TemperatureHistory 
        (City varchar(20)
        , Temperature decimal(5, 2)
        , DateTaken datetime)
    
    insert into TemperatureHistory values ('Minneapolis', 61, '20090101')
    insert into TemperatureHistory values ('Minneapolis', 59, '20090102')
    
    insert into TemperatureHistory values ('Milwaukee', 65, '20090101')
    insert into TemperatureHistory values ('Milwaukee', 65, '20090102')
    insert into TemperatureHistory values ('Milwaukee', 100, '20090103')
    
    insert into TemperatureHistory values ('Muskegon', 80, '20090101')
    insert into TemperatureHistory values ('Muskegon', 70, '20090102')
    insert into TemperatureHistory values ('Muskegon', 70, '20090103')
    insert into TemperatureHistory values ('Muskegon', 20, '20090104')
    
    ; with base_t as
        (select city
            , Temperature
            , row_number() over (partition by city order by temperature) as RowNum
            , (count(*) over (partition by city)) + 1 as CountPlusOne 
        from TemperatureHistory)
    select City
        , avg(Temperature) as MeanTemp
        , avg(case 
            when RowNum in (FLOOR(CountPlusOne/2.0), CEILING(CountPlusOne/2.0)) 
                then Temperature
                else null end) as MedianTemp
    from base_t 
    group by City