Search code examples
rtwittersparse-matrixsentiment-analysis

Create sparse matrix from tweets


I have some tweets and other variables that I would like to convert into a sparse matrix.

This is basically what my data looks like. Right now it is saved in a data.table with one column that contains the tweet and one column that contains the score.

Tweet               Score
Sample Tweet :)        1
Different Tweet        0

I would like to convert this into a matrix that looks like this:

Score Sample Tweet Different :)
    1      1     1         0  1
    0      0     1         1  0

Where there is one row in the sparse matrix for each row in my data.table. Is there an easy way to do this in R?


Solution

  • This is close to what you want

    library(Matrix)
    words = unique(unlist(strsplit(dt[, Tweet], ' ')))
    
    M = Matrix(0, nrow = NROW(dt), ncol = length(words))
    colnames(M) = words
    
    for(j in 1:length(words)){
      M[, j] = grepl(paste0('\\b', words[j], '\\b'), dt[, Tweet])
    }
    
    M = cbind(M, as.matrix(dt[, setdiff(names(dt),'Tweet'), with=F]))
    
    #2 x 5 sparse Matrix of class "dgCMatrix"
    #     Sample Tweet :) Different Score
    #[1,]      1     1  .         .     1
    #[2,]      .     1  .         1     .
    

    The only small issue is that the regex is not recognising ':)' as a word. Maybe someone who knows regex better can advise how to fix this.