Search code examples
sqlpostgresqlgreatest-n-per-group

Find the most frequent value per group in a table column


I need to find most frequent value of object_of_search for each ethnicity. How can I achieve this? Subqueries in the SELECT clause and correlated subqueries are not allowed. Something similar to this:

mode() WITHIN GROUP (ORDER BY stopAndSearches.object_of_search) AS "Most frequent object of search"

But this does not aggregate and gives me many rows for each ethnicity and object_of_search:

 officer_defined_ethnicity | Sas for ethnicity |   Arrest rate    | Most frequent object of search
---------------------------+-------------------+------------------+--------------------------------
 ethnicity2                |                 3 | 66.6666666666667 | Stolen goods
 ethnicity3                |                 2 |              100 | Fireworks
 ethnicity1                |                 5 |               60 | Firearms
 ethnicity3                |                 2 |              100 | Firearms
 ethnicity1                |                 5 |               60 | Cat
 ethnicity1                |                 5 |               60 | Dog
 ethnicity2                |                 3 | 66.6666666666667 | Firearms
 ethnicity1                |                 5 |               60 | Psychoactive substances
 ethnicity1                |                 5 |               60 | Fireworks

And should be something like this:

 officer_defined_ethnicity | Sas for ethnicity |   Arrest rate    | Most frequent object of search
---------------------------+-------------------+------------------+--------------------------------
 ethnicity2                |                 3 | 66.6666666666667 | Stolen goods
 ethnicity3                |                 2 |              100 | Fireworks
 ethnicity1                |                 5 |               60 | Firearms

Table on fiddle.
Query:

SELECT DISTINCT
    stopAndSearches.officer_defined_ethnicity,
    count(stopAndSearches.sas_id) OVER(PARTITION BY stopAndSearches.officer_defined_ethnicity) AS "Sas for ethnicity",
    sum(case when stopAndSearches.outcome = 'Arrest' then 1 else 0 end)
       OVER (PARTITION BY stopAndSearches.officer_defined_ethnicity)::float /
       count(stopAndSearches.sas_id) OVER(PARTITION BY stopAndSearches.officer_defined_ethnicity)::float * 100 AS "Arrest rate",
    mode() WITHIN GROUP (ORDER BY stopAndSearches.object_of_search) AS "Most frequent object of search"
FROM stopAndSearches
GROUP BY stopAndSearches.sas_id, stopAndSearches.officer_defined_ethnicity;

Table:

CREATE TABLE IF NOT EXISTS stopAndSearches(
    "sas_id" bigserial PRIMARY KEY,
    "officer_defined_ethnicity" VARCHAR(255),
    "object_of_search" VARCHAR(255),
    "outcome" VARCHAR(255)
);

Solution

  • Updated: Fiddle

    This should address the specific "which object per ethnicity" question.

    Note, this doesn't address ties in the count. That wasn't part of the question / request.

    Adjust your SQL to include this logic, to provide that detail:

    WITH cte AS (
            SELECT officer_defined_ethnicity
                 , object_of_search
                 , COUNT(*) AS n
                 , ROW_NUMBER() OVER (PARTITION BY officer_defined_ethnicity ORDER BY COUNT(*) DESC) AS rn
              FROM stopAndSearches
             GROUP BY officer_defined_ethnicity, object_of_search
         )
    SELECT * FROM cte
     WHERE rn = 1
    ;
    

    Result:

    officer_defined_ethnicity object_of_search n rn
    ethnicity1 Cat 1 1
    ethnicity2 Stolen goods 2 1
    ethnicity3 Fireworks 1 1