Search code examples
rquanteda

Assigning different numeric weights to different terms in quanteda dfm does not work


I am new to text analytics and currently trying out #Quanteda package in R for my needs. I want to assign different numeric weights for some specific and test the model accuracy. I tried the approach mentioned in other thread here to do this by preserving the dfm class Assigning weights to different features in R but could not get the correct output. Any help would be appreciated.

Here is what I tried

##install.packages("quanteda")
require(quanteda)
str <- c("apple is better than banana", "banana banana apple much  
better","much much better new banana")

weights <- c(apple = 5, banana = 3, much = 0.5)
myDfm <- dfm(str, remove = stopwords("english"), verbose = FALSE)

#output
##Document-feature matrix of: 3 documents, 5 features.
##3 x 5 sparse Matrix of class "dfmSparse"
##   features
##docs    apple better banana much new
##text1     1      1      1    0   0
##text2     1      1      2    1   0
##text3     0      1      1    2   1

newweights <- weights[featnames(myDfm)]
# reassign 1 to non-matched NAs
newweights[is.na(newweights)] <- 1

# this does not works for me - see the output
myDfm * newweights

##output
##Document-feature matrix of: 3 documents, 5 features.
##3 x 5 sparse Matrix of class "dfmSparse"
##   features
##docs    apple better banana much new
##text1     5    0.5    1.0    0   0
##text2     1    1.0    6.0    5   0
##text3     0    5.0    0.5    2   1

Environment Details

platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
version.string R version 3.2.2 (2015-08-14) nickname Fire Safety


Solution

  • This apparently has something to do with the * operator in the Matrix package on which the dfm class is based. This works:

    > matrix(1:6, nrow = 3) * c(2, 3)
         [,1] [,2]
    [1,]    2   12
    [2,]    6   10
    [3,]    6   18
    

    but this does not:

    > Matrix::Matrix(matrix(1:6, nrow = 3)) * c(2, 3)
    Error in Matrix(matrix(1:6, nrow = 3)) * c(2, 3) : 
      length of 2nd arg does not match dimension of first
    

    Until we get this fixed, here is a workaround: make the weight vector correspond element-by-element to the dfm.

    myDfm * rep(newweights, each = ndoc(myDfm))
    ## Document-feature matrix of: 3 documents, 5 features.
    ## 3 x 5 sparse Matrix of class "dfmSparse"
    ##        features
    ## docs    apple better banana much new
    ##   text1     5      1      3  0     0
    ##   text2     5      1      6  0.5   0
    ##   text3     0      1      3  1.0   1
    

    Updated:

    This is not a bug but a feature, and has to do with how the vector newweights is recycled to conform to the matrix that it is being multiplied with. R recycles this vector using column-major order, so it is creating the following matrix where which works in this example (although not as you want it to), to perform element-by-element multiplication:

    matrix(rep(newweights, 3), nrow = 3)
    ##      [,1] [,2] [,3] [,4] [,5]
    ## [1,]    5  0.5  1.0    1  3.0
    ## [2,]    1  1.0  3.0    5  0.5
    ## [3,]    3  5.0  0.5    1  1.0
    

    If you want to use your original strategy, this will work:

    t(t(myDfm) * newweights)
    ## Document-feature matrix of: 3 documents, 5 features (26.7% sparse).
    ## 3 x 5 sparse Matrix of class "dfmSparse"
    ##        features
    ## docs    apple better banana much new
    ##   text1     5      1      3  0     0
    ##   text2     5      1      6  0.5   0
    ##   text3     0      1      3  1.0   1
    

    because the recycling occurs now over features and not over documents.