I am new to text analytics and currently trying out #Quanteda package in R for my needs. I want to assign different numeric weights for some specific and test the model accuracy. I tried the approach mentioned in other thread here to do this by preserving the dfm class Assigning weights to different features in R but could not get the correct output. Any help would be appreciated.
Here is what I tried
##install.packages("quanteda")
require(quanteda)
str <- c("apple is better than banana", "banana banana apple much
better","much much better new banana")
weights <- c(apple = 5, banana = 3, much = 0.5)
myDfm <- dfm(str, remove = stopwords("english"), verbose = FALSE)
#output
##Document-feature matrix of: 3 documents, 5 features.
##3 x 5 sparse Matrix of class "dfmSparse"
## features
##docs apple better banana much new
##text1 1 1 1 0 0
##text2 1 1 2 1 0
##text3 0 1 1 2 1
newweights <- weights[featnames(myDfm)]
# reassign 1 to non-matched NAs
newweights[is.na(newweights)] <- 1
# this does not works for me - see the output
myDfm * newweights
##output
##Document-feature matrix of: 3 documents, 5 features.
##3 x 5 sparse Matrix of class "dfmSparse"
## features
##docs apple better banana much new
##text1 5 0.5 1.0 0 0
##text2 1 1.0 6.0 5 0
##text3 0 5.0 0.5 2 1
Environment Details
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 2.2
year 2015
month 08
day 14
svn rev 69053
language R
version.string R version 3.2.2 (2015-08-14)
nickname Fire Safety
This apparently has something to do with the *
operator in the Matrix package on which the dfm class is based. This works:
> matrix(1:6, nrow = 3) * c(2, 3)
[,1] [,2]
[1,] 2 12
[2,] 6 10
[3,] 6 18
but this does not:
> Matrix::Matrix(matrix(1:6, nrow = 3)) * c(2, 3)
Error in Matrix(matrix(1:6, nrow = 3)) * c(2, 3) :
length of 2nd arg does not match dimension of first
Until we get this fixed, here is a workaround: make the weight vector correspond element-by-element to the dfm.
myDfm * rep(newweights, each = ndoc(myDfm))
## Document-feature matrix of: 3 documents, 5 features.
## 3 x 5 sparse Matrix of class "dfmSparse"
## features
## docs apple better banana much new
## text1 5 1 3 0 0
## text2 5 1 6 0.5 0
## text3 0 1 3 1.0 1
Updated:
This is not a bug but a feature, and has to do with how the vector newweights
is recycled to conform to the matrix that it is being multiplied with. R recycles this vector using column-major order, so it is creating the following matrix where which works in this example (although not as you want it to), to perform element-by-element multiplication:
matrix(rep(newweights, 3), nrow = 3)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 0.5 1.0 1 3.0
## [2,] 1 1.0 3.0 5 0.5
## [3,] 3 5.0 0.5 1 1.0
If you want to use your original strategy, this will work:
t(t(myDfm) * newweights)
## Document-feature matrix of: 3 documents, 5 features (26.7% sparse).
## 3 x 5 sparse Matrix of class "dfmSparse"
## features
## docs apple better banana much new
## text1 5 1 3 0 0
## text2 5 1 6 0.5 0
## text3 0 1 3 1.0 1
because the recycling occurs now over features and not over documents.