Search code examples
rmatrixuniquecombinationscross-product

Generate matrix of unique user-item cross-product combinations


I am trying to create a cross-product matrix of unique users in R. I searched for it on SO but could not find what I was looking for. Any help is appreciated. I have a large dataframe (over a million) and a sample is shown:

df <- data.frame(Products=c('Product a', 'Product b', 'Product a', 
                            'Product c', 'Product b', 'Product c'),
                 Users=c('user1', 'user1', 'user2', 'user1', 
                         'user2','user3'))

Output of df is:

   Products Users
1 Product a user1
2 Product b user1
3 Product a user2
4 Product c user1
5 Product b user2
6 Product c user3

I would like to see two matrices: The first one will show the number of unique users that had either products(OR) - so the output will be something like:

            Product a   Product b   Product c
Product a                 2            3
Product b     2                        3
Product c     3           3 

The second matrix will be the number of unique users that had both products(AND):

            Product a   Product b   Product c
Product a                 2            1
Product b     2                        1
Product c     1           1 

Any help is appreciated.

Thanks

UPDATE:

Here is more clarity: Product a is used by User1 and User2. Product b is used by User1 and User2 and Product c is used by User1 and User3. So in the first matrix, Product a and Product b will be 2 since there are 2 unique users. Similarly, Product a and Product c will be 3. Where as in the second matrix, they would be 2 and 1 since I want the intersection. Thanks


Solution

  • Try

    lst <- split(df$Users, df$Products)
    ln <- length(lst)
    m1 <-  matrix(0, ln,ln, dimnames=list(names(lst), names(lst)))
    m1[lower.tri(m1, diag=FALSE)] <- combn(seq_along(lst), 2, 
                   FUN= function(x) length(unique(unlist(lst[x]))))
    m1[upper.tri(m1)] <- m1[lower.tri(m1)]
    m1
    #          Product a Product b Product c
    #Product a         0         2         3
    #Product b         2         0         3
    #Product c         3         3         0
    

    Or using outer

    f1 <- function(u, v) length(unique(unlist(c(lst[[u]], lst[[v]]))))
    res <- outer(seq_along(lst), seq_along(lst), FUN= Vectorize(f1)) *!diag(3)
    dimnames(res) <- rep(list(names(lst)),2)
    res
    #          Product a Product b Product c
    #Product a         0         2         3
    #Product b         2         0         3
    #Product c         3         3         0
    

    For the second case

    tcrossprod(table(df))*!diag(3)
    #            Products
    #Products    Product a Product b Product c
    # Product a         0         2         1
    # Product b         2         0         1
    # Product c         1         1         0