I have a data frame like this:
id | w1 | w2 | w3 | w4 | w5 | w6 |
---|---|---|---|---|---|---|
11 | light | light | light | light | light | light |
22 | light | light | light | light | medium | medium |
33 | light | light | medium | medium | medium | heavy |
44 | light | light | medium | NA | NA | NA |
55 | light | light | medium | medium | NA | NA |
66 | medium | medium | medium | NA | NA | NA |
I would like to get the frequency count of light, medium, heavy for each id across w1-w6. And I would to get the mode of w1-w6 as a new column.
The target df should look like this:
id | w1 | w2 | w3 | w4 | w5 | w6 | N_light | N_medium | N_heavy | final |
---|---|---|---|---|---|---|---|---|---|---|
11 | light | light | light | light | light | light | 6 | 0 | 0 | light |
22 | light | light | light | light | medium | medium | 4 | 2 | 0 | light |
33 | light | light | medium | medium | medium | heavy | 2 | 3 | 1 | medium |
44 | light | light | medium | NA | NA | NA | 2 | 1 | 0 | light |
55 | light | light | medium | medium | NA | NA | 2 | 2 | 0 | light |
66 | medium | medium | medium | NA | NA | NA | 0 | 3 | 0 | medium |
The real data frame has millions of rows. I struggle to find an efficient way to do this. Any ideas?
I tried the Mode function from DescTools library, that worked with a limited number of rows in a for loop. But it is too slow to run.
I know this asks for dplyr
, but if other find base R useful you could simply index and use *apply
functions
xx <- unique(unlist(df[-1]))
xx <- xx[!is.na(xx)]
# or xx <- c("light", "medium", "heavy")
newnames <- paste0("N_",xx)
df[newnames] <- sapply(xx,
function(x) rowSums(df[,-1] == x,
na.rm = TRUE))
df["final"] <- xx[apply(df[newnames], 1, which.max)]
Output:
id w1 w2 w3 w4 w5 w6 N_light N_medium N_heavy final
1 11 light light light light light light 6 0 0 light
2 22 light light light light medium medium 4 2 0 light
3 33 light light medium medium medium heavy 2 3 1 medium
4 44 light light medium <NA> <NA> <NA> 2 1 0 light
5 55 light light medium medium <NA> <NA> 2 2 0 light
6 66 medium medium medium <NA> <NA> <NA> 0 3 0 medium