I have two dfm and I would like to know which words are missing/different between them. For example,
library(quanteda)
df1 <- data.frame(Text = c("Stackoverflow is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)
corpus1 <- corpus(df1, text_field = "Text")
df2 <- data.frame(Text = c("Stackoverflow is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)
corpus2 <- corpus(df2, text_field = "Text")
dfm1 <- dfm(corpus1, remove_punct = TRUE)
dfm2 <- dfm(corpus2, remove_punct = TRUE)
I would like to see which words in dfm2 are not in dfm1. Thanks a lot for your help!
The question was how to compare the feature sets of two (quanteda) dfm objects, not to reinvent a method for tokenizing the texts.
> setdiff(featnames(dfm1), featnames(dfm2))
[1] "so" "stack" "immensely" "useful" "thank" "guys"
[7] "sort" "this" "out" "for"
to get the features in dfm1 that are not in dfm2.
@JBGruber's answer also works but in the forthcoming v2, we deprecate the usage of dfm_select()
where pattern
is another dfm