Search code examples
rsparse-matrix

Find co-occurrence of values in large data set


I have a large data set with month, customer ID and store ID. There is one record per customer, per location, per month summarizing their activity at that location.

  Month  Customer ID Store
  Jan       1      A
  Jan       4      A
  Jan       2      A
  Jan       3      A    
  Feb       7      B
  Feb       2      B
  Feb       1      B
  Feb       12     B
  Mar       1      C
  Mar       11     C 
  Mar       3      C
  Mar       12     C

I'm interested in creating a matrix that shows the number of customers that each location shares with another. Like this:

      A        B         C
A     4        2         2     
B     2        4         2
C     2        2         4

For example, since customer visited Store A and then Store B in the next month, they would be added to the tally. I'm interested in number of shared customers, not number of visits.

I tried the sparse matrix approach in this thread(Creating co-occurrence matrix), but the numbers returned don't match up for some reason I cannot understand.

Any ideas would be greatly appreciated!


Solution

  • Update: The original solution that I posted worked for your data. But your data has the unusual property that no customer ever visited the same store in two different months. Presuming that would happen, a modification is needed.

    What we need is a matrix of stores by customers that has 1 if the customer ever visited the store and zero otherwise. The original solution used
    M = as.matrix(table(Dat$ID_Store, Dat$Customer))
    which gives how many different months the store was visited by each customer. With different data, these numbers might be more than one. We can fix that by using
    M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
    If you look at this matrix, it will say TRUE and FALSE, but since TRUE=1 and FALSE=0 that will work just fine. So the full corrected solution is:

    M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
    M %*% t(M)
    
        A B C
      A 4 2 2
      B 2 4 2
      C 2 2 4