Search code examples
sqlsqlitegroup-byrankingmedian

SQL ranking query to compute ranks and median in sub groups


I want to compute the Median of y in sub groups of this simple xy_table:

  x | y --groups--> gid |   x | y --medians-->  gid |   x | y
-------             -------------               -------------
0.1 | 4             0.0 | 0.1 | 4               0.0 | 0.1 | 4
0.2 | 3             0.0 | 0.2 | 3                   |     |
0.7 | 5             1.0 | 0.7 | 5               1.0 | 0.7 | 5
1.5 | 1             2.0 | 1.5 | 1                   |     |
1.9 | 6             2.0 | 1.9 | 6                   |     |
2.1 | 5             2.0 | 2.1 | 5               2.0 | 2.1 | 5
2.7 | 1             3.0 | 2.7 | 1               3.0 | 2.7 | 1

In this example every x is unique and the table is already sorted by x. I now want to GROUP BY round(x) and get the tuple that holds the median of y in each group.

I can already compute the median for the whole table with this ranking query:

SELECT a.x, a.y FROM xy_table a,xy_table b
WHERE a.y >= b.y
GROUP BY a.x, a.y
HAVING count(*) = (SELECT round((count(*)+1)/2) FROM xy_table)

Output: 0.1, 4.0

But I did not yet succeed writing a query to compute the median for sub groups.

Attention: I do not have a median() aggregation function available. Please also do not propose solutions with special PARTITION, RANK, or QUANTILE statements (as found in similar but too vendor specific SO questions). I need plain SQL (i.e., compatible to SQLite without median() function)

Edit: I was actually looking for the Medoid and not the Median.


Solution

  • I suggest doing the computing in your programming language:

    for each group:
      for each record_in_group:
        append y to array
      median of array
    

    But if you are stuck with SQLite, you can order each group by y and select the records in the middle like this http://sqlfiddle.com/#!5/d4c68/55/0:

    UPDATE: only bigger "median" value is importand for even nr. of rows, so no avg() is needed:

    select groups.gid,
      ids.y median
    from (
      -- get middle row number in each group (bigger number if even nr. of rows)
      -- note the integer divisions and modulo operator
      select round(x) gid,
        count(*) / 2 + 1 mid_row_right
      from xy_table
      group by round(x)
    ) groups
    join (
      -- for each record get equivalent of
      -- row_number() over(partition by gid order by y)
      select round(a.x) gid,
        a.x,
        a.y,
        count(*) rownr_by_y
      from xy_table a
      left join xy_table b
        on round(a.x) = round (b.x)
        and a.y >= b.y
      group by a.x
    ) ids on ids.gid = groups.gid
    where ids.rownr_by_y = groups.mid_row_right