Search code examples
prestomaxby

Presto: is MAX_BY() deterministic


Is the function MAX_BY() deterministic. If I use MAX_() for two different columns both depending on a third one, will I get the same row result?

The presto documentation doesn't mention this. This documentation about mysql mention that it is not, so I'm not sure where to find this info.

I tested quickly with the following:

WITH my_table(id, arr, something) AS (
    VALUES
        (1, ARRAY['one'], 0.0),
        (2, ARRAY['two'], 0.0),
        (3, ARRAY['three'], 0.0),
        (4, ARRAY['four'], 0.0),
        (5, ARRAY['five'], 0.0),
        (6, ARRAY[''], 0.0)
)
SELECT
    MAX_BY(id,something),
    MAX_BY(arr,something)
FROM my_table

It returned the first row, so it doesn't feel arbitrary but also does not prove things.

Anyone out there able to help?

There is a related question to return multiple columns from a single MAX_BY() so I'm thinking that I need to use that solution to guarantee the attribute of the same row is selected: max_by with multiple return columns


Solution

  • No, in the case of ties, the result of max_by and min_by is arbitrary. It may appear to be deterministic, but that's not defined behavior and may change at some point.

    If you want all the values to be consistent, you have to use the trick you referred to, where you pack all the columns of interest in a single value of type ROW:

    SELECT max_by((x1, x2, x3), y) r
    FROM (...) t(y, x1, x2, x3)