Search code examples
sqlpostgresqlsubqueryaggregate-functionscommon-table-expression

Select most frequent value between 2 values per id with conditions


Is there any 'better code' than what I've done with cte ?

I want to select the most frequent value between M and F per id with the following rules=

  • sex values different than ilike 'F' or 'M', not considered in frequency calc
  • if most frequent calc not successful = 'inconclusive'
  • if most frequent calc successful = sex value

here is an example:

dataset

id sex
1 M
1 F
2 M
2 M
2 x
2 F
3 F
3 0
4 f
4
5 d

result expected

id sex
1 inconclusive
2 M
3 F
4 F

what I've done :

WITH cte1 AS (
                SELECT 
             id,
            sex,
            RANK() OVER (PARTITION BY id ORDER BY count(*)) rn
        FROM dataset
        WHERE sex ~* '(F|M)' AND sex IS NOT NULL
        GROUP BY id, sex
        ),
    cte2 AS (
         SELECT id,
                       max(rn) AS max
        FROM cte1
        GROUP BY id
        ),
    cte3 AS (
        SELECT cte2.id,
                       sex
        FROM cte2
        LEFT JOIN cte1 ON cte2.id=cte1.id AND max=rn
        WHERE cte1.id IS NOT NULL 
        ),
    cte4 AS (   
        SELECT id,
                      count(*) as cnt
        FROM cte3
        GROUP BY id
        )
SELECT DISTINCT cte4.id,
               CASE 
               WHEN cnt>1 THEN 'inconclusive'
               WHEN cnt=1 AND SEX IN ('F', 'M') THEN sex
               END AS sex
FROM cte4
LEFT JOIN cte3 ON cte4.id=cte3.id

To me the code is efficient in a sense that it gives the appropriate result but it looks a bit bulky, I'm looking for improvements. Is there any ?

note: I've used DISTINCT ON () but it can not retrieve id 1 = inconclusive (F or M depending the order)


Solution

  • You seem to be overcomplicating this.

    I would start by filtering out values other than M or F, then aggregate by id, and count how many times each value appears:

    select id, 
        count(*) filter(where sex = 'M') cnt_m,
        count(*) filter(where sex = 'F') cnt_f
    from dataset
    where sex in ('M', 'F')
    group by id
    

    I don't see the need for a regular expression match in the where clause, since it seems you just want to retain 'F' and 'M' values.

    From there on, all we have to do is compare the counts. We can do it in an outer query so we don't need to repeat the conditional expressions:

    select id, 
        case when cnt_m > cnt_f then 'M'
             when cnt_m < cnt_f then 'F'
             else 'inconclusive'
        end as res
    from (
        select id, 
            count(*) filter(where sex = 'M') cnt_m,
            count(*) filter(where sex = 'F') cnt_f
        from dataset
        where sex in ('M', 'F')
        group by id
    ) t