Search code examples
sqlpostgresqlaggregate-functionsgreenplum

Find modal values in each row, with custom tie breaking in PostreSQL


I have table like

create table  vals (k, v, z, y, m) AS VALUES 
(0, 0, 1, 8, 0), 
(1, 2, 1, 1, 6), 
(1, 0, 0, 2, 2),
(5, 4, 6, 8, 9), 
(0, 0, 0, 6, 6);

I want to calculate modal value for each row. If there are multiple modal values, return such value, that occurs first.

k v z y m modal
0 0 1 8 0 0
1 2 1 1 6 1
1 0 0 2 2 0
5 4 6 8 9 5
0 0 0 6 6 0
  • In the 3rd row values 0 and 2 are equally frequent but 0 occurs first so it wins.
  • In the 4th row all values are equally frequent but 5 occurs first so it wins.

How can I do it in PostreSQL?


Solution

  • PostgreSQL has a built-in mode()within group(order by e) function:

    mode () WITHIN GROUP ( ORDER BY anyelement ) → anyelement
    Computes the mode, the most frequent value of the aggregated argument (arbitrarily choosing the first one if there are multiple equally-frequent values). The aggregated argument must be of a sortable type.

    Which matches the Wikipedia definition of a modal value:

    Type Description Example Result
    Arithmetic mean Sum of values of a data set divided by number of values (1+2+2+3+4+7+9) / 7 4
    Median Middle value separating the greater and lesser halves of a data set 1, 2, 2, 3, 4, 7, 9 3
    Mode Most frequent value in a data set 1, 2, 2, 3, 4, 7, 9 2

    The function is an ordered-set aggregate, not a variadic, so you'd have to pretend values in each row are a column. You can pack them into an array[], then unnest() it, then process it and return that, all inside a correlated scalar subquery:
    demo at db-fiddle

    WITH arrs as(select *,array[k, v, z, y, m] as arr 
                 from vals)
    select k, v, z, y, m
          ,(select mode()within group(order by e) 
            from unnest(arr) as e) as modal
    from arrs;
    

    Or try unpivoting, then re-aggregating:

    WITH one_column as(select ctid,k as c from vals
                       union all
                       select ctid,v from vals
                       union all
                       select ctid,z from vals
                       union all
                       select ctid,y from vals
                       union all
                       select ctid,m from vals)
    select min(vals.k)
          ,min(vals.v)
          ,min(vals.z)
          ,min(vals.y)
          ,min(vals.m)
          ,mode()within group(order by c)
    from one_column
    join vals on one_column.ctid=vals.ctid
    group by ctid;
    

    Your updated example breaks ties in a different way than mode() does:

    arbitrarily choosing the first one if there are multiple equally-frequent values

    It looks like what you want is for the one earlier in the unordered list to win.
    To achieve that behaviour, you can just count(*) values in each row (packed into array[], then unnest()ed), sort this by placing most frequent first, and break ties by promoting the one that appeared earlier. unnest()with ordinality adds the information about the original position of the unpacked element.
    demo at db-fiddle

    WITH arrs as(select *,array[k, v, z, y, m] as arr 
                 from vals)
    select k, v, z, y, m
          ,(select e 
            from unnest(arr)with ordinality as _(e,ordinality)
            group by e
            order by count(*) desc
                    ,min(ordinality) asc--earlier in the undordered list wins
                    --,e desc--larger one wins
            limit 1)
    from arrs;
    

    To get all modal value candidates, you can use fetch first 1 rows with ties. This also uses a special array() constructor that accepts a subquery directly:
    demo at db-fiddle

    WITH arrs as(select *,array[k, v, z, y, m] as arr 
                 from vals)
    select k, v, z, y, m
          ,array(select unnest(arr)e
                 group by e
                 order by count(*) desc
                 fetch first 1 rows with ties
           ) as all_modal_value_candidates
    from arrs;
    

    In older versions of PostgreSQL, you can emulate that using rank()over():

    WITH arrs as(select *,array[k, v, z, y, m] as arr 
                 from vals)
    select k, v, z, y, m
          ,(select array_agg(e)
            from(select unnest(arr)e, rank()over(order by count(*) desc)
                 group by e)_
            where rank=1
           ) as all_modal_value_candidates
    from arrs;