Search code examples
rquanteda

What is the code to know which words are different between two dfm?


I have two dfm and I would like to know which words are missing/different between them. For example,

library(quanteda)

df1 <- data.frame(Text = c("Stackoverflow is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD. So Stack is immensely useful. Thank you guys to sort this out for me."), stringsAsFactors = F)

corpus1 <- corpus(df1, text_field = "Text")

df2 <- data.frame(Text = c("Stackoverflow is a great place where very skilled data scientists are willing to help you. Trust me you will need help if you are doing a PhD."), stringsAsFactors = F)
corpus2 <- corpus(df2, text_field = "Text")

dfm1 <- dfm(corpus1, remove_punct = TRUE)

dfm2 <- dfm(corpus2, remove_punct = TRUE)

I would like to see which words in dfm2 are not in dfm1. Thanks a lot for your help!


Solution

  • The question was how to compare the feature sets of two (quanteda) dfm objects, not to reinvent a method for tokenizing the texts.

    > setdiff(featnames(dfm1), featnames(dfm2))
     [1] "so"        "stack"     "immensely" "useful"    "thank"     "guys"     
     [7] "sort"      "this"      "out"       "for" 
    

    to get the features in dfm1 that are not in dfm2.

    @JBGruber's answer also works but in the forthcoming v2, we deprecate the usage of dfm_select() where pattern is another dfm