I have the following dataset which I'm trying to trim (sample of the dataframe is below)
Index | Grade |
---|---|
Ace_1_1 | A |
Ace_1_1 | A |
Ace_1_1 | B |
Ace_1_1 | C |
Ace_1_2 | A |
Ace_1_2 | C |
Ace_1_2 | C |
Ace_1_3 | B |
Ace_1_3 | B |
Ace_2_2 | C |
Ace_2_2 | A |
Ace_2_2 | B |
Ace_2_5 | C |
Ace_2_5 | C |
Ace_2_5 | A |
Ace_2_5 | A |
Ace_3_1 | D |
df2 = structure(list(Index = c("Ace_1_1", "Ace_1_1", "Ace_1_1", "Ace_1_1",
"Ace_1_2", "Ace_1_2", "Ace_1_2", "Ace_1_3", "Ace_1_3", "Ace_2_2",
"Ace_2_2", "Ace_2_2", "Ace_2_5", "Ace_2_5", "Ace_2_5", "Ace_2_5",
"Ace_3_1"), Grade = c("A", "A", "B", "C", "A", "C", "C", "B",
"B", "C", "A", "B", "C", "C", "A", "A", "D")), class = "data.frame", row.names = c(NA,-17L))
I'm trying to trim it down such that for each index, only the most common grade would show. If there is a tie in frequency, the tied grades would be shown. If there's only one entry the entry would be show as well. The ideal output for the table above would be
Index | Grade |
---|---|
Ace_1_1 | A |
Ace_1_2 | C |
Ace_1_3 | B |
Ace_2_2 | C |
Ace_2_2 | A |
Ace_2_2 | B |
Ace_2_5 | C |
Ace_2_5 | A |
Ace_3_1 | D |
I'm using the group_by() function by counting the frequency with mutate (Frequency = n()) but I'm having trouble subsetting the most frequent count. Appreciate if anyone could help me count here either using dplyr or even base R, thanks!
Using data.table or dplyr, you can simply take the rows in each group where the count equals the max count:
library(dplyr)
count(df2,Index,Grade) %>% filter(n==max(n), .by=Index) %>% select(-n)
OR
library(data.table)
setDT(df2)[, .N, .(Index, Grade)][, .SD[N==max(N), Grade], Index]
Output:
Index Grade
<char> <char>
1: Ace_1_1 A
2: Ace_1_2 C
3: Ace_1_3 B
4: Ace_2_2 A
5: Ace_2_2 B
6: Ace_2_2 C
7: Ace_2_5 A
8: Ace_2_5 C
9: Ace_3_1 D