In text2vec
package, I am using create_vocabulary function. For eg:
My text is "This book is very good" and suppose I am not using stopwords and an ngram of 1L to 3L. so the vocab terms will be
This, book, is, very, good, This book,..... book is very, very good. I just want to remove the term "book is very" (and host of other terms using a vector). Since I just want to remove a phrase I cant use stopwords. I have coded the below code:
vocab<-create_vocabulary(it,ngram=c(1L,3L))
vocab_mod<- subset(vocab,!(term %in% stp) # where stp is stop phrases.
x<- read.csv(Filename') #these are all stop phrases
stp<-as.vector(x$term)
When I do the above step, the metainformation in attributes get lost in vocab_mod and so can't be used in create_dtm
.
@Dmitriy even this lets to drop the attributes... So the way out that I found was just adding the attributes manually for now using attr function
attr(vocab_mod,"ngram")<-c(ngram_min = 1L,ngram_max=3L) and son one for other attributes as well. We can get attribute details from vocab.