Search code examples
sqlsql-serversql-server-2008recursionmedian

SQL: Calculate the median for each date range


I am using SQL Server 2008 and am trying to calculate median values per date range.

Example: There are 4 dates with multiple values at each date, (1/1/16, 3/1/16, 7/1/16, 10/1/16), the median value for date 10/1/16 would be calculated from the values in the date range 1/1/16 – 7/1/16. The median value for date 7/1/16 would be calculated from the values in the date range 1/1/16 – 3/1/16.

If the date is 10/1/16, 10/1/16 values should not be included in the median (This true for all dates. Also, future dates should not be included in the calculation).

The query below calculates the median value for the MAX Visit Date. However, I need it to calculate the median for the other 3 Visit Dates as well. I have tried removing the MAX CTE and adding a join for all visit dates < than visit date but I could not get it to work. I have been unsuccessful in writing this so far, so any help would be great. I have included sample data and my expected results below.

EDIT: Would some kind of recursion possibly work?

;CREATE TABLE #TEST(QUESTION VARCHAR(15), VISIT_DATE DATE, VALUE INT)

;INSERT #TEST(QUESTION, VISIT_DATE, VALUE)
VALUES
('ABC', '1/1/2016', '80'),
('ABC', '1/1/2016', '90'),
('ABC', '1/1/2016', '100'),
('ABC', '3/1/2016', '70'),
('ABC', '3/1/2016', '80'),
('ABC', '3/1/2016', '90'),
('ABC', '3/1/2016', '100'),
('ABC', '7/1/2016', '50'),
('ABC', '7/1/2016', '60'),
('ABC', '7/1/2016', '70'),
('ABC', '10/1/2016', '10'),
('ABC', '10/1/2016', '20'),
('ABC', '10/1/2016', '30'),
('ABC', '10/1/2016', '40')


;WITH MAX_VISITDATE AS (
    SELECT MAX(VISIT_DATE) AS MAX_VISITDATE
    FROM #TEST
), MEDIAN AS (
    SELECT RN.Question, AVG(RN.VALUE) AS GroupMedianPastQtrs
    FROM
    (   SELECT QUESTION, VALUE, ROW_NUMBER() OVER (PARTITION BY QUESTION ORDER BY VALUE) AS ROWNUMBER, COUNT(*) OVER (PARTITION BY Question) AS QuestionCount
        FROM #TEST T 
        WHERE VISIT_DATE NOT IN (SELECT MAX_VISITDATE FROM MAX_VISITDATE)
    ) RN
    WHERE RN.ROWNUMBER IN (RN.QuestionCount/2+1, (RN.QuestionCount+1)/2)
    GROUP BY RN.Question
)
SELECT *
FROM #TEST T
INNER JOIN MEDIAN ON T.Question = MEDIAN.Question

--Expected Results:

Question|Visit_DAte |Value|GroupMedian  |
--------|-----------|-----|-------------|
'ABC'   |'1/1/2016' |'80' |''           |--No Median, no previous values
'ABC'   |'1/1/2016' |'90' |''           |--No Median, no previous values
'ABC'   |'1/1/2016' |'100'|''           |--No Median, no previous values
'ABC'   |'3/1/2016' |'70' |'90'         |--Median value from date 1/1/16
'ABC'   |'3/1/2016' |'80' |'90'         |--Median value from date 1/1/16
'ABC'   |'3/1/2016' |'90' |'90'         |--Median value from date 1/1/16
'ABC'   |'3/1/2016' |'100'|'90'         |--Median value from date 1/1/16
'ABC'   |'7/1/2016' |'50' |'90'         |--Median value from date range 1/1/16 to 3/1/16
'ABC'   |'7/1/2016' |'60' |'90'         |--Median value from date range 1/1/16 to 3/1/16
'ABC'   |'7/1/2016' |'70' |'90'         |--Median value from date range 1/1/16 to 3/1/16
'ABC'   |'10/1/2016'|'10' |'80'         |--Median value from date range 1/1/16 to 7/1/16
'ABC'   |'10/1/2016'|'20' |'80'         |--Median value from date range 1/1/16 to 7/1/16
'ABC'   |'10/1/2016'|'30' |'80'         |--Median value from date range 1/1/16 to 7/1/16
'ABC'   |'10/1/2016'|'40' |'80'         |--Median value from date range 1/1/16 to 7/1/16

Solution

  • I don't have a SQL Server 2008 box to test this. So I tried my best to cross-check every function below is available in 2008:

    ;WITH
        tmp AS
        (
            SELECT          a.QUESTION
                        ,   a.VISIT_DATE
                        ,   b.VALUE
                        ,   ROW_NUMBER() OVER (PARTITION BY a.QUESTION, a.VISIT_DATE ORDER BY b.VALUE)
                                                                    AS RowNumber
                        ,   FLOOR(CONVERT(float, COUNT(b.Value) OVER (PARTITION BY a.QUESTION, a.VISIT_DATE) + 1) / 2)
                                                                    AS LowerMedianRowNumber
                        ,   CEILING(CONVERT(float, COUNT(b.Value) OVER (PARTITION BY a.QUESTION, a.VISIT_DATE) + 1) / 2)
                                                                    AS UpperMedianRowNumber
            FROM            (
                                SELECT DISTINCT
                                            QUESTION
                                        ,   VISIT_DATE
                                FROM        #TEST
                            )       a
            INNER JOIN      #TEST   b   ON a.QUESTION   = b.QUESTION
                                       AND a.VISIT_DATE > b.VISIT_DATE
    
        ),
        GroupMedian AS
        (
            SELECT          QUESTION
                        ,   VISIT_DATE
                        ,   AVG(Value)      AS MedianValue
            FROM            tmp
            WHERE           RowNumber IN (LowerMedianRowNumber, UpperMedianRowNumber)
            GROUP BY        QUESTION
                        ,   VISIT_DATE
        )
    
    SELECT          a.*
            ,       b.MedianValue
    FROM            #TEST           a
    LEFT  JOIN      GroupMedian     b   ON a.QUESTION   = b.QUESTION
                                       AND a.VISIT_DATE = b.VISIT_DATE
    ORDER BY        QUESTION
            ,       VISIT_DATE