Search code examples
rcsvsparse-matrix

Efficient way to convert CSV to Sparse Matrix in R


I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords that when I read into a data.frame looks like:

> df   
kwd1 kwd2 similarity  
a  b  1  
b  a  1  
c  a  2  
a  c  2 

It is a sparse list and I would like to convert it into a sparse matrix:

> myMatrix 
  a b c  
a . 1 2
b 1 . .
c 2 . .

I tried using sparseMatrix(), but converting the keyword names to integer indexes takes too much time.

Thanks for any help!


Solution

  • acast from the reshape2 package will do this nicely. There are base R solutions but I find the syntax much more difficult.

    library(reshape2)
    df <- structure(list(kwd1 = structure(c(1L, 2L, 3L, 1L), .Label = c("a", 
    "b", "c"), class = "factor"), kwd2 = structure(c(2L, 1L, 1L, 
    3L), .Label = c("a", "b", "c"), class = "factor"), similarity = c(1L, 
    1L, 2L, 2L)), .Names = c("kwd1", "kwd2", "similarity"), class = "data.frame", row.names = c(NA, 
    -4L))
    
    acast(df, kwd1 ~ kwd2, value.var='similarity', fill=0)
    
      a b c
    a 0 1 2
    b 1 0 0
    c 2 0 0
    > 
    

    using sparseMatrix from the Matrix package:

    library(Matrix)
    df$kwd1 <- factor(df$kwd1)
    df$kwd2 <- factor(df$kwd2)
    
    foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity)
    
    > foo
    3 x 3 sparse Matrix of class "dgCMatrix"
    
    
    foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity, dimnames=list(levels(df$kwd1), levels(df$kwd2)))
    
    > foo 
    
    3 x 3 sparse Matrix of class "dgCMatrix"
      a b c
    a . 1 2
    b 1 . .
    c 2 . .