Search code examples
rmatrixelementsparse-matrix

Creating a 100K*100K (10 billion element) sparse matrix in R


I have a csv file that contains 2 columns .Column 1 has the username and Column 2 the username to whom the user has replied to. There are a total of 2 million records in the file. There are around a 100K unique usernames in Column1 and Column 2. I want to create a 100K*100K matrix that will give the number of times each user has communicated with the other 99,999 users. Is it possible to create the matrix in R? Obviously the matrix will be very sparse with at least 99.98% of the matrix being zero's since there are only 2 million records out of the possible 10 billion that is merely a 0.02 percent. How do I find how many times each user has communicated with the other 99,999 users and put it in the form of a matrix?


Solution

  • You can use sparseMatrix from the Matrix package:

     require(Matrix)
     #this just to generate some random strings
     require(stringi)
     set.seed(1)
     #generating 100k usernames
     users<-stri_rand_strings(100000,6)
     #simulating col1 and col2
     col1<-sample(users,1000000,T)
     col2<-sample(users,1000000,T)
     #hashing to integer values through factor
     col1<-factor(col1,levels=users)
     col2<-factor(col2,levels=users)
     #creating the matrix
     mySparseMatrix<-sparseMatrix(as.numeric(col1),as.numeric(col2),x=1)
     #not a huge object
     object.size(mySparseMatrix)
     #12400720 bytes
    

    In this way you create a sparseMatrix whose i,j value is one if the i-th user communicates the j-th user and 0 otherwise.

    Edit

    If you want also to show how many times the i-th user communicated with the j-th, we can ask help to the data.table package. Just after creating col1 and col2:

      require(data.table)
      dt<-data.table(col1=factor(col1,levels=users),col2=factor(col2,levels=users))
      #aggregating by col1 and col2
      dt<-dt[,list(times=.N),by=list(col1,col2)]
      mySparseMatrix<-sparseMatrix(as.numeric(dt$col1),as.numeric(dt$col2),x=dt$times)