Search code examples
rcontingency

R get the levels associated with the maximum value in a multidimensional contingency table


With a simple vector like

x <- sample(letters[1:3], size=20, replace=T)

I would extract the most frequent letter with something like

y <- table(x)
print(names(y)[y==max(y)])
"b"

However, using the same technique over a multidimensional dataframe does not work:

set.seed(5)
x <- data.frame(c1=sample(letters[1:3], size=30, replace=T),
                c2=sample(letters[4:5], size=30, replace=T),
                c3=sample(letters[6:10], size=30, replace=T))
y <- table(x)

print(names(y)[y==max(y)])
NULL

How can I extract the levels of c1, c2, and c3 that have the highest value in the contingency table?

I know I could convert the table to a dataframe and find the row where the Freq column is highest, but given the number of dimensions & levels in my dataset, doing the conversion to a dataframe would not fit in my RAM memory.

Edit: So my expected output in the second case would be c, d, j, as in:

z <- data.frame(y)
z[z$Freq==max(z$Freq), 1:3]
   c1 c2 c3
27  c  d  j

But note that I cannot use the data.frame call on my data due to RAM issues.


Solution

  • You can use which with arr.ind = TRUE:

    mapply("[", 
           dimnames(y), 
           as.data.frame(which(y == max(y), arr.ind = TRUE)))
    # c1  c2  c3 
    #"c" "d" "j"
    
    mapply("[", 
           dimnames(y), 
           as.data.frame(which(y == min(y), arr.ind = TRUE)))
    #      c1  c2  c3 
    # [1,] "a" "d" "f"
    # [2,] "b" "d" "g"
    # [3,] "c" "d" "g"
    # [4,] "b" "e" "g"
    # [5,] "a" "d" "h"
    # [6,] "b" "d" "h"
    # [7,] "c" "d" "h"
    # [8,] "c" "e" "h"
    # [9,] "a" "e" "i"
    #[10,] "b" "e" "i"
    #[11,] "c" "e" "i"