Search code examples
mysqlsortingmedian

Complicated ordering by median in MySQL


Example table:

age | fruit | number_bought |
20  | apple | 3000000       |
20  | apple | 20            |
20  | apple | 60            |
20  | apple | 30            |
20  | apple | 50            |
20  | apple | 4             |
20  | banana| 40            |
30  | grape | 400           |
30  | grape | 450           |
30  | grape | 500           |

Simply a list of how many of a specific fruit a person of a specific age buys while he is that age.

Now, I need to sort this table by "most popular fruits", grouped by age and fruit.

And here's the tricky part, I want to use MEDIAN for calculating popularity, not just the average number. Since some people might be far off the normal (he might be a salesman), like the 3000000 in the example above, while the "average 20 year old" buys alot less, as you can see from the example.

The above table, sorted by median popularity should come out like this:

age | fruit | median |
30  | grape | 450    |
20  | apple | 40     |
20  | banana| 40     |

Now if I simply had used "average" calculation, 20, apple would have won the popularity, simply because of one salesman. So thats why I want to use median.


Solution

  • The common median queries seem to struggle when there are an even number of items (for example, with apple in your test data).

    Simple way round that would be :-

    SELECT y.age, x.fruit, AVG(x.number_bought) AS number_bought
    from data x
    INNER JOIN data y
    ON x.age = y.age
    AND x.fruit = y.fruit
    GROUP BY y.age, x.fruit, x.number_bought
    HAVING SUM(SIGN(1-SIGN(y.number_bought-x.number_bought))) = FLOOR((COUNT(*)+1)/2)
    ORDER BY number_bought DESC;
    

    This isn't strictly accurate as it is just taking the one before the middle one (ie, the median of the 6 records would be the one in position 3.5 - this just uses FLOOR and gets record number 3).

    Possibly slightly more accurate would be this which will get the mean of the 2 records either side when there are an even number

    SELECT age, fruit, AVG(number_bought) AS number_bought
    FROM 
    (
        SELECT y.age, x.fruit, AVG(x.number_bought) AS number_bought
        from data x
        INNER JOIN data y
        ON x.age = y.age
        AND x.fruit = y.fruit
        GROUP BY y.age, x.fruit, x.number_bought
        HAVING SUM(SIGN(1-SIGN(y.number_bought-x.number_bought))) = FLOOR((COUNT(*)+1)/2)
        OR SUM(SIGN(1-SIGN(y.number_bought-x.number_bought))) = CEIL((COUNT(*)+1)/2)
    ) Sub1
    GROUP BY age, fruit
    ORDER BY number_bought DESC;
    

    SQL fiddles here:-

    http://www.sqlfiddle.com/#!2/f1b49/13