I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords that when I read into a data.frame looks like:
> df
kwd1 kwd2 similarity
a b 1
b a 1
c a 2
a c 2
It is a sparse list and I would like to convert it into a sparse matrix:
> myMatrix
a b c
a . 1 2
b 1 . .
c 2 . .
I tried using sparseMatrix(), but converting the keyword names to integer indexes takes too much time.
Thanks for any help!
acast
from the reshape2
package will do this nicely. There are base R solutions but I find the syntax much more difficult.
library(reshape2)
df <- structure(list(kwd1 = structure(c(1L, 2L, 3L, 1L), .Label = c("a",
"b", "c"), class = "factor"), kwd2 = structure(c(2L, 1L, 1L,
3L), .Label = c("a", "b", "c"), class = "factor"), similarity = c(1L,
1L, 2L, 2L)), .Names = c("kwd1", "kwd2", "similarity"), class = "data.frame", row.names = c(NA,
-4L))
acast(df, kwd1 ~ kwd2, value.var='similarity', fill=0)
a b c
a 0 1 2
b 1 0 0
c 2 0 0
>
using sparseMatrix
from the Matrix
package:
library(Matrix)
df$kwd1 <- factor(df$kwd1)
df$kwd2 <- factor(df$kwd2)
foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity)
> foo
3 x 3 sparse Matrix of class "dgCMatrix"
foo <- sparseMatrix(as.integer(df$kwd1), as.integer(df$kwd2), x=df$similarity, dimnames=list(levels(df$kwd1), levels(df$kwd2)))
> foo
3 x 3 sparse Matrix of class "dgCMatrix"
a b c
a . 1 2
b 1 . .
c 2 . .