Search code examples
sqlgoogle-bigquery

Google SQL - Return the most frequent string dependent on 2 columns


I need to find the most frequent text value in a table but I am having a difficult time wrapping my head around applying the solutions I have found to my dataset.

My table is structured as follows:

Column A Column B
Val A Val a
Val A Val b
Val A Val a
Val B Val a
Val B Val b
Val B Val b

I have created gotten the basic concept of calculating mode using count and max functions with CTEs but I really can't figure out how to return the table I need.

What I am expecting is a table like:

Column A Column B
Val A Val a
Val B Val b

Val a is the most frequent for Val A and b for B.

There are hundreds of unique values in Col A and I'm trying to solve this so that I can then return the 2nd most frequent value in Col B if the most frequent is null.

Thanks for any guidance!

WITH counts AS(
  SELECT
  ColA,
  COUNT(ColB) AS ColB_count
  FROM table
  GROUP BY ColA
)

SELECT ColA, ColB_count
FROM counts
WHERE ColB_count = (
  SELECT MAX(ColB_count)
  FROM counts
);

Solution

  • I'd say this is what you are trying to do with your attempted sql. Loosely speaking, you need the counts, and then you need the max counts, and then you need to put these two back together to match the max count with the value (from ColB) that it goes with. But be sure to test with ties and with nulls (if applicable) to ensure you understand how these cases will behave.

    with cte1 as (
        select 
            ColA,
            ColB,
            count(ColB) as CountOfColB
        from Table
        group by
            ColA,
            ColB
    ),
    cte2 as (
        select
            ColA,
            Max(CountOfColB) as total
        from cte1
        group by ColA
    )
    select 
        cte1.ColA,
        cte1.ColB
        --,cte1.CountOfColB
    from 
        cte1
        inner join cte2
        on cte1.ColA = cte2.ColA
        and cte1.CountOfColB = cte2.total
    order by 
        cte1.ColA,
        cte1.ColB
    

    This is another approach using a window function with rank in the second cte (again, check your results when/if you have ties or nulls to be sure you are handling these as desired):

    with cte1 as (
        select 
            ColA,
            ColB,
            count(ColB) as CountOfColB
        from Table
        group by
            ColA,
            ColB
    ),
    cte2 as (
        select
            ColA,
            ColB,
            rank() over (partition by ColA order by CountOfColB) as RankedValue
        from cte1
    )
    select 
        cte2.ColA,
        cte2.ColB
    from 
        cte2
    where cte2.RankedValue = 1
    order by 
        cte2.ColA,
        cte2.ColB