Search code examples
rtext-miningtext2vec

How to produce document term matrix in text2vector only from stored list of words


What is the syntax in text2vec to vectorize texts and achieve dtm with only the indicated list of words?

How to vectorize and produce document term matrix only on indicated features? And if the features do not appear in the text the variable should stay empty.

I need to produce term document matrices with exactly the same columns as in the dtm I run the modelling on, otherwise I cannot use random forest model on new documents.


Solution

  • You can create document term matrix only from specific set of features:

    v = create_vocabulary(c("word1", "word2"))
    vectorizer = vocab_vectorizer(v)
    dtm_test = create_dtm(it, vectorizer)
    

    However I don't recommend to 1) use random forest on such sparse data - it won't work good 2) perform feature selection way you described - you will likely overfit.