Search code examples
rtext-miningtext2vec

How do I include stopwords(terms) in text2vec


In text2vec package, I am using create_vocabulary function. For eg: My text is "This book is very good" and suppose I am not using stopwords and an ngram of 1L to 3L. so the vocab terms will be

This, book, is, very, good, This book,..... book is very, very good. I just want to remove the term "book is very" (and host of other terms using a vector). Since I just want to remove a phrase I cant use stopwords. I have coded the below code:

vocab<-create_vocabulary(it,ngram=c(1L,3L))
vocab_mod<- subset(vocab,!(term %in% stp) # where stp is stop phrases.

x<- read.csv(Filename') #these are all stop phrases
stp<-as.vector(x$term)

When I do the above step, the metainformation in attributes get lost in vocab_mod and so can't be used in create_dtm.


Solution

  • @Dmitriy even this lets to drop the attributes... So the way out that I found was just adding the attributes manually for now using attr function

    attr(vocab_mod,"ngram")<-c(ngram_min = 1L,ngram_max=3L) and son one for other attributes as well. We can get attribute details from vocab.