Search code examples
rforcats

How to collapse factors into "other" (not based on size)


I'm working with NHL player data and I basically want to compare a select Players points to the rest of the population. So I have the player data which looks like this:

 Player Season Team  Position    GP   TOI     G     A     P    P1 `P/60`
 <chr>   <int> <chr> <chr>    <int> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
 Aaron~   2019 FLA   D           35 603.      3     2     5     3   0.5 
 Adam ~   2019 CBJ   D            4  35.5     0     0     0     0   0  
Adam ~   2019 T.B   L           23 218.      2     7     9     5   2.48

and so on for the rest of the league. I'd like to compare a summary statistic between one of the observations to the rest of the data set.

 Player Season Team  Position    Summary Statistic
 <chr>   <int> <chr> <chr>             <int>
 Kasperi   2019 FLA   D                  45 
 "Others"  2019 CBJ   D                  53 

I've seen fct_lump used to select the top records, sorted on some count - but when I tried something similar to using the Player names I couldn't get it to work.

NHL %>% 
 mutate(Player = fct_lump(Player,
                              Kasperi Kapanen = "Kasperi Kapanen",
                              other = !("Kasperi Kapanen")))

Solution

  • fct_lump is not appropriate to handle the flexibility you want. you should use dplyr's if_else for one against all other observations

    library(dplyr)
    NHL %>% 
        mutate(Player = if_else(Player == "Kasperi Kapanen", "Kasperi Kapanen",
                                                             "others"))
    

    OR case_when for multiple ifelse comparisons.

    NHL %>% 
        mutate(Player = case_when(
                           Player == "Kasperi Kapanen" ~ "Kasperi Kapanen", 
                           Player == "Adam" ~ "Adam",
                           TRUE ~ "others" 
                                 ))