Search code examples
rcross-product

Counting pairs column elements with the same value in a data frame and shows in a matrix format


I was searching on internet for similar solution, but I was not able to find the specific one for my case. Let's say a have the following data frame:

a = c(1, 1, 1, 2, 2)
b = c(2, 1, 1, 1, 2)
c = c(2, 2, 1, 1, 1)
d = c(1, 2, 2, 1, 1)
df <- data.frame(a = a, b = b, c = c, d = d)

and df looks like this:

  a b c d
1 1 2 2 1
2 1 1 2 2
3 1 1 1 2
4 2 1 1 1
5 2 2 1 1

Note: In this example I use [1,2] pair of values, but it could be a set of different values: [-1,1] or even more than two possible values: [-1,1,2].

Now I would like to have a matrix where each [i,j] element will represent the number of rows with the value 1 for column i and j. For this particular case we have (showing the upper diagonal, because its symmetric):

  a b c d
a 3 2 1 1
b   3 2 1
c     3 2
d       3

The diagonal should count the number of rows with 1 value at a given column. On this case all columns have the sames number of value 1. The format should be similar to cor() function (Correlation Matrix).

I was trying to use table() (and also crosstab from descr package) but it shows the information by pairs of columns.

It can be done by computing manually the occurrence of 1 of each pair of columns (i.e.: nrow(df[df$a==1 & df$b==1,])=2) and then putting into a matrix, but I was wondering if there is a built-in function that simplify the process.


Solution

  • We can use crossprod on a matrix for computing the occurrences of the value 1 of the question´s example:

    m1 <- as.matrix(df == 1) # see Note[1]
    out <- crossprod(m1)
    

    Note[1] Pointed by @imo (see comments below) for addressing the general case (a matrix with values: [x,y]). For a matrix with [0,1] values df==1can be replaced by df. For counting the 2 values from question's example, then use: df == 2.

    If the lower diagonal should be 0 or NA

    out[lower.tri(out)] <- NA
    out
    #   a  b  c d
    #a  3  2  1 1
    #b NA  3  2 1
    #c NA NA  3 2
    #d NA NA NA 3