I'm trying to calculate how many people play different combinations of sports. I know how to calculate and plot the correlation, but I'd like to get the actual numbers of how many people fall into each combined group.
Here's a simplified demo of my data. A 1
means a given person plays the given sport, and a 0
means she does not.
sports_example <- tibble(
name = c(
"Scarlett",
"Heather",
"Sarah",
"Anna",
"Emma",
"Charlotte",
"Cheryl"
),
hockey = c(1L, 1L, 1L, 1L, 0L, 1L, 1L),
basketball = c(0L, 1L, 1L, 0L, 1L, 1L, 0L),
track = c(1L, 1L, 1L, 0L, 1L, 0L, 1L),
football = c(0L, 1L, 0L, 0L, 0L, 0L, 0L)
)
Using the code below, I can calculate the correlations between different sports. I can tell that hockey shares more players with basketball than football. But I'd like to calculate how many athletes play both hockey and basketball (3 in this case). Is there an easy algorithm for counting the overlap? My real dataset has dozens of columns, so my one-off code isn't cutting it (e.g. nrow(filter(sports_example, hockey + basketball == 2))
).
sports_example %>%
select(-name) %>%
cor() %>%
corrgram::corrgram(upper.panel = NULL)
For pairwise results, you can calculate the matrix cross product of the binary variables, then set the upper triangle and diagonal to zero so we can filter to keep only the unique pairs.
res <- crossprod(as.matrix(sports_example[-1]))
res[upper.tri(res, diag = TRUE)] <- 0
subset(as.data.frame.table(res), Freq > 0)
Var1 Var2 Freq
2 basketball hockey 3
3 track hockey 4
4 football hockey 1
7 track basketball 3
8 football basketball 1
12 football track 1