Search code examples
rcorrelation

Calculate overlap between binary variables


I'm trying to calculate how many people play different combinations of sports. I know how to calculate and plot the correlation, but I'd like to get the actual numbers of how many people fall into each combined group.

Here's a simplified demo of my data. A 1 means a given person plays the given sport, and a 0 means she does not.

    sports_example <- tibble(
      name = c(
        "Scarlett",
        "Heather",
        "Sarah",
        "Anna",
        "Emma",
        "Charlotte",
        "Cheryl"
      ),
      hockey = c(1L, 1L, 1L, 1L, 0L, 1L, 1L),
      basketball = c(0L, 1L, 1L, 0L, 1L, 1L, 0L),
      track = c(1L, 1L, 1L, 0L, 1L, 0L, 1L),
      football = c(0L, 1L, 0L, 0L, 0L, 0L, 0L)
    )

Using the code below, I can calculate the correlations between different sports. I can tell that hockey shares more players with basketball than football. But I'd like to calculate how many athletes play both hockey and basketball (3 in this case). Is there an easy algorithm for counting the overlap? My real dataset has dozens of columns, so my one-off code isn't cutting it (e.g. nrow(filter(sports_example, hockey + basketball == 2))).

sports_example %>% 
  select(-name) %>% 
  cor() %>% 
  corrgram::corrgram(upper.panel = NULL)

Correlation plot of Andrew's demo data


Solution

  • For pairwise results, you can calculate the matrix cross product of the binary variables, then set the upper triangle and diagonal to zero so we can filter to keep only the unique pairs.

    res <- crossprod(as.matrix(sports_example[-1]))
    res[upper.tri(res, diag = TRUE)] <- 0
    subset(as.data.frame.table(res), Freq > 0)
    
             Var1       Var2 Freq
    2  basketball     hockey    3
    3       track     hockey    4
    4    football     hockey    1
    7       track basketball    3
    8    football basketball    1
    12   football      track    1