Search code examples
rlistsimilarityrecommendation-engine

Find most common combination between pairs


I have a list of events and guests who have attended these events. Like so but a much bigger file:

event       guests
birthday    John Doe
birthday    Jane Doe
birthday    Mark White
wedding     John Doe
wedding     Jane Doe
wedding     Matthew Green
bar mitzvah Janet Black
bar mitzvah John Doe
bar mitzvah Jane Doe
bar mitzvah William Hill
retirement  Janet Black
retirement  Matthew Green

I want to find the most common combination of two guests who attend the most events together. So in this example, the answer should be John Doe and Jane Doe attend the most events together because they have both attended three of the same events. The output should be a list of these pairs.

Where do I even start?


Solution

  • A slightly different approach from a social networks/matrix algebra point of view:

    Your data describes links between individuals by shared membership. This is an affiliation matrix and we can compute the matrix of connections between individuals $i$ and $j$ as follows:

    # Load as a data frame
    df <- data.frame(event = c(rep("birthday", 3), 
                               rep("wedding", 3), 
                               rep("bar mitzvah", 4), 
                               rep("retirement", 2)), 
                      guests = c("John Doe", "Jane Doe", "Mark White", 
                                 "John Doe", "Jane Doe", "Matthew Green",   
                                  "Janet Black", "John Doe", "Jane Doe",
                                  "William Hill", "Janet Black", "Matthew Green"))
    
    # You can represent who attended which event as a matrix
    M <- table(df$guests, df$event)
    # Now we can compute how many times each individual appeared at an
    # event with another with a simple matrix product
    admat <- M %*% t(M)
    admat
    
    
      ##################Jane Doe Janet Black John Doe Mark White Matthew Green William Hill
      #Jane Doe             3           1        3          1             1            1
      #Janet Black          1           2        1          0             1            1
      #John Doe             3           1        3          1             1            1
      #Mark White           1           0        1          1             0            0
      #Matthew Green        1           1        1          0             2            0
      #William Hill         1           1        1          0             0            1
    

    Now we want to get rid of the diagonal of the matrix (which tells us how many events each individual attended) and one of the two triangles of the matrix which contains redundant information.

    diag(admat) <- 0
    admat[upper.tri(admat)] <- 0
    

    Now we just want to convert to a format you might prefer. I'll use the melt function in the reshape2 library.

    library(reshape2)
    dfmatches <- unique(melt(admat))
    # Drop all the zero matches
    dfmatches <- dfmatches[dfmatches$value !=0,]
    # order it descending
    dfmatches <- dfmatches[order(-dfmatches$value),]
    dfmatches
    
    #            Var1        Var2 value
    #3       John Doe    Jane Doe     3
    #2    Janet Black    Jane Doe     1
    #4     Mark White    Jane Doe     1
    #5  Matthew Green    Jane Doe     1
    #6   William Hill    Jane Doe     1
    #9       John Doe Janet Black     1
    #11 Matthew Green Janet Black     1
    #12  William Hill Janet Black     1
    #16    Mark White    John Doe     1
    #17 Matthew Green    John Doe     1
    #18  William Hill    John Doe     1
    

    Obviously you could tidy the output up by renaming the variables of interest etc.

    This general approach -- by which I mean recognizing that your data describes a social network -- might be of interest to you for further analysis (for instance, maybe people are meaningfully linked if they go to parties with a lot of the same people, even if not with each other). If your data set is really big you can make the matrix algebra a little faster by using sparse matrices, or by loading the igraph package and working with the functions there for declaring social networks.