I have some tweets and other variables that I would like to convert into a sparse matrix.
This is basically what my data looks like. Right now it is saved in a data.table with one column that contains the tweet and one column that contains the score.
Tweet Score
Sample Tweet :) 1
Different Tweet 0
I would like to convert this into a matrix that looks like this:
Score Sample Tweet Different :)
1 1 1 0 1
0 0 1 1 0
Where there is one row in the sparse matrix for each row in my data.table. Is there an easy way to do this in R?
This is close to what you want
library(Matrix)
words = unique(unlist(strsplit(dt[, Tweet], ' ')))
M = Matrix(0, nrow = NROW(dt), ncol = length(words))
colnames(M) = words
for(j in 1:length(words)){
M[, j] = grepl(paste0('\\b', words[j], '\\b'), dt[, Tweet])
}
M = cbind(M, as.matrix(dt[, setdiff(names(dt),'Tweet'), with=F]))
#2 x 5 sparse Matrix of class "dgCMatrix"
# Sample Tweet :) Different Score
#[1,] 1 1 . . 1
#[2,] . 1 . 1 .
The only small issue is that the regex is not recognising ':)'
as a word. Maybe someone who knows regex better can advise how to fix this.