I am working on a classification task using quanteda in R and I want to include some variables to be considered by my models apart from the bag of words. for instance, I computed dictionary based sentiment indexes and I d like to include these variables so that the models consider them.
these are the indexes I created, for each document.
dfneg <- cbind(negDfm1@docvars$label , negDfm1@x ,posDfm@x , angDfm@x ,
disgDfm1@x)
colnames(dfneg) <- c("label","neg" , "pos" , "ang" , "disg")
dfneg <- as.data.frame(dfneg)
this is the document features matrix I will work with:
newsdfm <- dfm(newscorp, tolower = TRUE , stem = FALSE , remove_punct =
TRUE, remove = stopwords("english"),verbose=TRUE)
newst<- dfm_trim(newsdfm , min_docfreq=2 , verbose=TRUE)
id_train <- sample(1:6335, 5384, replace = FALSE)
# create docvar with ID
docvars(newst, "id_numeric") <- 1:ndoc(newst)
# get training set
train <- dfm_subset(newst, id_numeric %in% id_train)
# get test set (documents not in id_train)
test <- dfm_subset(newst, !id_numeric %in% id_train)
finally, I run a classification, for instance, a Naive Bayes classifier or lasso
NBmodel <- textmodel_nb(train , train@docvars$label)
lasso <- cv.glmnet(train, train@docvars$label,
family="binomial", alpha=1, nfolds=10,
type.measure="class")
this is what I tried after creating the dfm, but it didn't work
newsdfm@Dimnames$features$negz <- dfneg$neg
newsdfm@Dimnames$features$posz <- dfneg$pos
newsdfm@Dimnames$features$angz <- dfneg$ang
newsdfm@Dimnames$features$disgz <- dfneg$disg
then I thought of creating document variables before creating newsdfm
docvars(newscorp , "negz") <- dfneg$neg
docvars(newscorp , "posz") <- dfneg$pos
docvars(newscorp , "angz") <- dfneg$ang
docvars(newscorp , "disgz") <- dfneg$disg
but at that point, I don't know how to tell the classifier that I want it to consider also these document variables in addition to the bag of words.
In summary, I expect the model to consider both the matrix with all the words per each document and the indexes I created per each document.
any suggestion is highly appreciated
thank you in advance,
Carlo
Internally, dfm are sparse matrices, but it is better to avoid manipulating them directly if possible.
For adding new features for textmodel_nb()
, you need to add them to the dfm. As you might expect, the easiest way to do so is to use cbind()
to dfm.
In your example, you can run something like this:
additional_features <- dfneg[, c("neg", "pos", "ang", "disg")] %>% as.matrix()
newsdfm_added <- cbind(newsdfm, additional_features)
As you see, I firstly created a matrix of additional features and then run cbind()
. When you execute cbind()
you will get the following warning:
Warning messages:
1: cbinding dfms with different docnames
2: cbinding dfms with overlapping features will result in duplicated features
As this indicates you have to make sure that the colnames for the additional features should not be in the original dfm.