Search code examples
rdplyrtop-n

Tidyverse: filtering n largest groups in grouped dataframe


I want to filter the n largest groups based on count, and then do some calculations on the filtered dataframe

Here is some data

Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)

|Brand | Category| Clicks|
|:-----|--------:|------:|
|A     |        1|     10|
|B     |        2|     11|
|C     |        1|     12|
|A     |        1|     13|
|A     |        2|     14|
|B     |        1|     15|
|A     |        2|     14|
|A     |        1|     13|
|B     |        2|     12|
|C     |        1|     11|

This is my expected output. I want to filter out the two largest brands by count and then find the mean clicks in each brand / category combination

|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A     |        1|        12.0|
|A     |        2|        14.0|
|B     |        1|        15.0|
|B     |        2|        11.5|

Which I thought could be achieved with code like this (but can't)

df %>%
  group_by(Brand, Category) %>%
  top_n(2, Brand) %>% # Largest 2 brands by count
  summarise(mean_clicks = mean(Clicks))

EDIT: the ideal answer should be able to be used on database tables as well as local tables


Solution

  • Another dplyr solution using a join to filter the data frame:

    library(dplyr)
    
    df %>%
      group_by(Brand) %>%
      summarise(n = n()) %>%
      top_n(2) %>% # select top 2
      left_join(df, by = "Brand") %>% # filters out top 2 Brands
      group_by(Brand, Category) %>%
      summarise(mean_clicks = mean(Clicks))
    
    # # A tibble: 4 x 3
    # # Groups:   Brand [?]
    #   Brand Category mean_clicks
    #   <fct>    <dbl>       <dbl>
    # 1 A            1        12  
    # 2 A            2        14  
    # 3 B            1        15  
    # 4 B            2        11.5