I have a list of events and guests who have attended these events. Like so but a much bigger file:
event guests
birthday John Doe
birthday Jane Doe
birthday Mark White
wedding John Doe
wedding Jane Doe
wedding Matthew Green
bar mitzvah Janet Black
bar mitzvah John Doe
bar mitzvah Jane Doe
bar mitzvah William Hill
retirement Janet Black
retirement Matthew Green
I want to find the most common combination of two guests who attend the most events together. So in this example, the answer should be John Doe
and Jane Doe
attend the most events together because they have both attended three of the same events. The output should be a list of these pairs.
Where do I even start?
A slightly different approach from a social networks/matrix algebra point of view:
Your data describes links between individuals by shared membership. This is an affiliation matrix and we can compute the matrix of connections between individuals $i$ and $j$ as follows:
# Load as a data frame
df <- data.frame(event = c(rep("birthday", 3),
rep("wedding", 3),
rep("bar mitzvah", 4),
rep("retirement", 2)),
guests = c("John Doe", "Jane Doe", "Mark White",
"John Doe", "Jane Doe", "Matthew Green",
"Janet Black", "John Doe", "Jane Doe",
"William Hill", "Janet Black", "Matthew Green"))
# You can represent who attended which event as a matrix
M <- table(df$guests, df$event)
# Now we can compute how many times each individual appeared at an
# event with another with a simple matrix product
admat <- M %*% t(M)
admat
##################Jane Doe Janet Black John Doe Mark White Matthew Green William Hill
#Jane Doe 3 1 3 1 1 1
#Janet Black 1 2 1 0 1 1
#John Doe 3 1 3 1 1 1
#Mark White 1 0 1 1 0 0
#Matthew Green 1 1 1 0 2 0
#William Hill 1 1 1 0 0 1
Now we want to get rid of the diagonal of the matrix (which tells us how many events each individual attended) and one of the two triangles of the matrix which contains redundant information.
diag(admat) <- 0
admat[upper.tri(admat)] <- 0
Now we just want to convert to a format you might prefer. I'll use the melt
function in the reshape2 library.
library(reshape2)
dfmatches <- unique(melt(admat))
# Drop all the zero matches
dfmatches <- dfmatches[dfmatches$value !=0,]
# order it descending
dfmatches <- dfmatches[order(-dfmatches$value),]
dfmatches
# Var1 Var2 value
#3 John Doe Jane Doe 3
#2 Janet Black Jane Doe 1
#4 Mark White Jane Doe 1
#5 Matthew Green Jane Doe 1
#6 William Hill Jane Doe 1
#9 John Doe Janet Black 1
#11 Matthew Green Janet Black 1
#12 William Hill Janet Black 1
#16 Mark White John Doe 1
#17 Matthew Green John Doe 1
#18 William Hill John Doe 1
Obviously you could tidy the output up by renaming the variables of interest etc.
This general approach -- by which I mean recognizing that your data describes a social network -- might be of interest to you for further analysis (for instance, maybe people are meaningfully linked if they go to parties with a lot of the same people, even if not with each other). If your data set is really big you can make the matrix algebra a little faster by using sparse matrices, or by loading the igraph package and working with the functions there for declaring social networks.