How to produce document term matrix in text2vector only from stored list of words

What is the syntax in text2vec to vectorize texts and achieve dtm with only the indicated list of words?

How to vectorize and produce document term matrix only on indicated features? And if the features do not appear in the text the variable should stay empty.

I need to produce term document matrices with exactly the same columns as in the dtm I run the modelling on, otherwise I cannot use random forest model on new documents.

Solution

You can create document term matrix only from specific set of features:

v = create_vocabulary(c("word1", "word2"))
vectorizer = vocab_vectorizer(v)
dtm_test = create_dtm(it, vectorizer)

However I don't recommend to 1) use random forest on such sparse data - it won't work good 2) perform feature selection way you described - you will likely overfit.

check if two columns have a one-to-one relationship in R
How to extract Std.Dev from VarCorr glmmTMB
Determine level of nesting in R?
How do you print to stderr in R?
How to plot China map with South China Sea in base R
Calculate mean of matrices having different dimensions
Get column and row position of nth element in a matrix
Is there any authoritative documentation on R release nicknames?
R Glassdoor Web Scraping
Issue with graticule across 180° for several country/territory EEZs
Separating grouped layers in a raster stack in terra
Way to web-scrape a popular eSport website using R?
Variance calculation warning: longer object length is not a multiple
gratia::draw(): "'length.out' must be a non-negative number"
Using Swift as custom engine in knitr and including all previous content
convert source target value dataframe into a correlation matrix
ggplot2 plotting a 100% stacked area chart
Use string as formula for ipwtm function?
interpolarization within groups with NA
Multi-row x-axis labels in ggplot line chart
How to do a SOAP request for EUR-Lex API with R?
Make an alluvial plot
Parameters for the ggplot theme function about legend.axis.line
Error handling for tidyr hoist in API call dplyr pipe when column type changes between calls
calculate distance between regression line and datapoint
Colour picker input not updating output in R Shiny
Order() in R - argument is missing, with no default
How to plot geom_bar without showing multiple lines
Computing only the n first rows of a distance matrix with R torch
R: speeding up "group by" operations