I have a csv file that contains 2 columns .Column 1 has the username and Column 2 the username to whom the user has replied to. There are a total of 2 million records in the file. There are around a 100K unique usernames in Column1 and Column 2. I want to create a 100K*100K matrix that will give the number of times each user has communicated with the other 99,999 users. Is it possible to create the matrix in R? Obviously the matrix will be very sparse with at least 99.98% of the matrix being zero's since there are only 2 million records out of the possible 10 billion that is merely a 0.02 percent. How do I find how many times each user has communicated with the other 99,999 users and put it in the form of a matrix?
You can use sparseMatrix
from the Matrix
package:
require(Matrix)
#this just to generate some random strings
require(stringi)
set.seed(1)
#generating 100k usernames
users<-stri_rand_strings(100000,6)
#simulating col1 and col2
col1<-sample(users,1000000,T)
col2<-sample(users,1000000,T)
#hashing to integer values through factor
col1<-factor(col1,levels=users)
col2<-factor(col2,levels=users)
#creating the matrix
mySparseMatrix<-sparseMatrix(as.numeric(col1),as.numeric(col2),x=1)
#not a huge object
object.size(mySparseMatrix)
#12400720 bytes
In this way you create a sparseMatrix
whose i,j
value is one if the i-th user communicates the j-th user and 0 otherwise.
Edit
If you want also to show how many times the i-th user communicated with the j-th, we can ask help to the data.table
package. Just after creating col1
and col2
:
require(data.table)
dt<-data.table(col1=factor(col1,levels=users),col2=factor(col2,levels=users))
#aggregating by col1 and col2
dt<-dt[,list(times=.N),by=list(col1,col2)]
mySparseMatrix<-sparseMatrix(as.numeric(dt$col1),as.numeric(dt$col2),x=dt$times)