Search code examples
rdplyrmode

How to create a new column based on the mode of some other columns in a data frame in R


I have a data frame like this:

id w1 w2 w3 w4 w5 w6
11 light light light light light light
22 light light light light medium medium
33 light light medium medium medium heavy
44 light light medium NA NA NA
55 light light medium medium NA NA
66 medium medium medium NA NA NA

I would like to get the frequency count of light, medium, heavy for each id across w1-w6. And I would to get the mode of w1-w6 as a new column.

The target df should look like this:

id w1 w2 w3 w4 w5 w6 N_light N_medium N_heavy final
11 light light light light light light 6 0 0 light
22 light light light light medium medium 4 2 0 light
33 light light medium medium medium heavy 2 3 1 medium
44 light light medium NA NA NA 2 1 0 light
55 light light medium medium NA NA 2 2 0 light
66 medium medium medium NA NA NA 0 3 0 medium

The real data frame has millions of rows. I struggle to find an efficient way to do this. Any ideas?

I tried the Mode function from DescTools library, that worked with a limited number of rows in a for loop. But it is too slow to run.


Solution

  • I know this asks for dplyr, but if other find base R useful you could simply index and use *apply functions

    xx <- unique(unlist(df[-1]))
    xx <- xx[!is.na(xx)]
     # or xx <- c("light", "medium", "heavy")
    newnames <- paste0("N_",xx)
    
    df[newnames] <- sapply(xx, 
                           function(x) rowSums(df[,-1] == x, 
                                               na.rm = TRUE))
    df["final"] <- xx[apply(df[newnames], 1, which.max)]
    

    Output:

      id     w1     w2     w3     w4     w5     w6 N_light N_medium N_heavy  final
    1 11  light  light  light  light  light  light       6        0       0  light
    2 22  light  light  light  light medium medium       4        2       0  light
    3 33  light  light medium medium medium  heavy       2        3       1 medium
    4 44  light  light medium   <NA>   <NA>   <NA>       2        1       0  light
    5 55  light  light medium medium   <NA>   <NA>       2        2       0  light
    6 66 medium medium medium   <NA>   <NA>   <NA>       0        3       0 medium