Search code examples
sqlwindow-functionsmediangreenplum

Rolling (moving) median in Greenplum


I would like to calculate the rolling median for a column in Greenplum, i.e. as below:

|  x | rolling_median_x |
| -- + ---------------- |
|  4 |                4 |
|  1 |              2.5 |
|  3 |                3 |
|  2 |              2.5 |
|  1 |                2 |
|  6 |              2.5 |
|  9 |                3 |

x is an integer and for each row rolling_median_x shows the median of x for the current and preceding rows. E.g. for the third row rolling_median_x = median(4, 1, 3) = 3.

Things I've found out so far:

  • the median function can't be used as a framed window function, i.e. median(x) OVER(RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
  • the same is true for many other function such as percent_rank or nth_value
  • recursive self join is not supported in this version of Greenplum

As a matter of fact I was unable to find proper documentation on which functions can be used as framed window function in Greenplum...

I'm using Greenplum 4.3.4.0 (which is based on Postgres 8.2.15) and updating is not an option unfortunately.


Solution

  • One remark - a citate from Wikipedia: ORDER BY

    ORDER BY is the only way to sort the rows in the result set. Without this clause, the relational database system may return the rows in any order. If an ordering is required, the ORDER BY must be provided in the SELECT statement sent by the application. Although some database systems allow the specification of an ORDER BY clause in subselects or view definitions, the presence there has no effect. A view is a logical relational table, and the relational model mandates that a table is a set of rows, implying no sort order whatsoever.


    Since you need to calculate the median for the current and preceding rows, you must have in the table an additional row that defines the order of rows and can be used to determine which rows precede given row and which ones come after.
    Let say some id column like this:

    | id | x | rolling_median_x |
    |----|---|------------------|
    |  1 | 4 |                4 |
    |  2 | 1 |              2.5 |
    |  3 | 3 |                3 |
    |  4 | 2 |              2.5 |
    |  5 | 1 |                2 |
    |  6 | 6 |              2.5 |
    |  7 | 9 |                3 |
    

    If you cannot use analytic functions, then try pure SQL.
    This article shows various methods of computing the Median with SQL.
    I think the Henderson’s Median would be best for our needs:

    SELECT CASE COUNT(*) % 2
           WHEN 0        -- even sized table
           THEN (P1.part_wgt + MIN(CASE WHEN P2.part_wgt > P1.part_wgt
                                      THEN P2.part_wgt
                                      ELSE NULL END))/2.0
           ELSE P1.part_wgt --odd sized table
           END AS median 
      FROM Parts AS P1, Parts AS P2
     GROUP BY P1.part_wgt
    HAVING COUNT(CASE WHEN P1.part_wgt >= P2.part_wgt
                      THEN 1
                      ELSE NULL END)
           = (COUNT(*) + 1) / 2;
    

    Just run this query for each row as a dependent subquery, a general idea is like this:

    SELECT t.*, (
            SELECT .... Henderson's query FROM table x
            WHERE x.id <= t.id
            ......
           ) As our_median
    FROM table t
    

    You can find an example implementation in this demo

    SELECT t.*, (
        SELECT CASE COUNT(*) % 2
               WHEN 0        -- even sized table
               THEN (P1.x + MIN(CASE WHEN P2.x > P1.x
                                          THEN P2.x
                                          ELSE NULL END))/2.0
               ELSE P1.x --odd sized table
               END AS median 
          FROM Table333 AS P1, Table333 AS P2
          WHERE p1.id <= t.id AND p2.id <= t.id
         GROUP BY P1.x
        HAVING COUNT(CASE WHEN P1.x >= P2.x
                          THEN 1
                          ELSE NULL END)
               = (COUNT(*) + 1) / 2
        ) as Our_median
    FROM Table333 t;
    
    | id | x | rolling_median_x | our_median |
    |----|---|------------------|------------|
    |  1 | 4 |                4 |          4 |
    |  2 | 1 |              2.5 |        2.5 |
    |  3 | 3 |                3 |          3 |
    |  4 | 2 |              2.5 |        2.5 |
    |  5 | 1 |                2 |          2 |
    |  6 | 6 |              2.5 |        2.5 |
    |  7 | 9 |                3 |          3 |
    

    This query will probably be slow - this is a price you must pay for having ancient version of PostgreSQL