Search code examples
rdplyrgroup-summaries

Is there a dplyr function to determine the most commonly encountered categorical value within a group?


I am looking to summarize a customer transactional dataframe to a single row per customer using dplyr. For continuous variables this is simple - use sum / mean etc. For categorical variables I would like to choose the "Mode" - i.e. the most commonly encountered value within the group and do this across multiple columns e.g.:

For example to take the table Cus1

Cus <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03"),
             Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH"),
              Store = c("NYC", "NYC", "Chicago", "Chicago", "Detroit", "Detroit", "LA", "San Fran")
              )

And generate the table Cus_Summary:

Cus_Summary <- data.frame(Customer = c("C-01", "C-02", "C-03"),
              Product = c("COKE", "BURGER", "CHICKEN"),
              Store = c("NYC", "Chicago", "LA")
              )

Are there any packages that can provide this function? Or has anyone a function that can be applied across multiple columns within a dplyr step?

I am not worried about smart ways to handle ties - any output for a tie will suffice (although any suggestions as to how to best handle ties would be interesting and appreciated).


Solution

  • How about this?

    Cus %>%
        group_by(Customer) %>%
        summarise(
            Product = first(names(sort(table(Product), decreasing = TRUE))),
            Store = first(names(sort(table(Store), decreasing = TRUE))))
    ## A tibble: 3 x 3
    #  Customer Product Store
    #  <fct>    <chr>   <chr>
    #1 C-01     COKE    NYC
    #2 C-02     BURGER  Chicago
    #3 C-03     CHICKEN LA
    

    Note that in the case of ties this selects the first entry in alphabetical order.


    Update

    To randomly select an entry from tied top frequency entries we could define a custom function

    top_random <- function(x) {
        tbl <- sort(table(x), decreasing = T)
        top <- tbl[tbl == max(tbl)]
        return(sample(names(top), 1))
    }
    

    Then the following randomly selects one of the tied top entries:

    Cus %>%
        group_by(Customer) %>%
        summarise(
            Product = top_random(Product),
            Store = top_random(Store))