Search code examples
rcountrecode

Get the most frequent value per row and account for ties


Sample data:

df <- data.frame("ID" = 1:6, 
                 "Group1" = c("A", NA, "C", NA, "E", "C"), 
                 "Group2" = c("E", "C", "C", NA, "E", "E"),
                 "Group3" = c("A", "A", NA, NA, "C", NA),
                 "Group4" = c(NA, "C", NA, "D", "C", NA),
                 "Group5" = c("A", "D", NA, NA, NA, NA))

In each row, I want to count the number of each value and store the most frequent value in a new variable, New.Group. In case of ties, the first value in the row should be selected. The logic applied to the example:

Row 1 of New.Group takes value A because it is most frequent value in the row, ignoring NAs.

Row 2 takes value C because it is also the most frequent value.

Row 3 the same as Row 2.

Row 4 takes value D because it's the only value in the row.

In Row 5 both E and C has count 2, but E is selected because it is encountered before C in the row.

Row 6, similar to row 5, both C and E has count 1, but C is selected because it is encountered before E in the row.

The desired output:

  ID Group1 Group2 Group3 Group4 Group5 New.Group
1  1      A      E      A   <NA>      A         A
2  2   <NA>      C      A      C      D         C
3  3      C      C   <NA>   <NA>   <NA>         C
4  4   <NA>   <NA>   <NA>      D   <NA>         D
5  5      E      E      C      C   <NA>         E
6  6      C      E   <NA>   <NA>   <NA>         C

Solution

  • I think this achieves what you're looking for. For each row, it creates a table of frequencies of each letter and chooses the largest, whilst preserving column order for ties. It then returns the name of the first column in this table.

    Thanks to Henrik for suggesting the improvement.

    df$New.Group <- apply(df[-1], 1, function(x) {
    names(which.max(table(factor(x, unique(x)))))
    })
    
    df
    #>   ID Group1 Group2 Group3 Group4 Group5 New.Group
    #> 1  1      A      E      A   <NA>      A         A
    #> 2  2   <NA>      C      A      C      D         C
    #> 3  3      C      C   <NA>   <NA>   <NA>         C
    #> 4  4   <NA>   <NA>   <NA>      D   <NA>         D
    #> 5  5      E      E      C      C   <NA>         E
    #> 6  6      C      E   <NA>   <NA>   <NA>         C