Search code examples
rquanteda

Detect the same word in tokens ngram and remove them


In a dfm how is it possible to detect in an ngram the same words i.e.

hello_hello, text_text

and remove them from the dfm?


Solution

  • For a dfm in which your ngram elements are joined by _, then you can split them and determine which are the same.

    library("quanteda")
    ## Package version: 2.1.2
    
    dfmat <- dfm(c("test1_test1", "test1_test2", "test2_test2_test2", "test2_other", "other"))
    
    featsplit <- strsplit(featnames(dfmat), "_")
    same <- sapply(featsplit, function(y) {
      length(y) >= 2 & # it's a compound (ngram)
        length(unique(y)) == 1 # all elements are the same
    })
    
    same
    ## [1]  TRUE FALSE  TRUE FALSE FALSE
    

    You can then use this to make a selection for the elements of the dfm that are not the same:

    dfmat[, !same]
    ## Document-feature matrix of: 5 documents, 3 features (80.0% sparse).
    ##        features
    ## docs    test1_test2 test2_other other
    ##   text1           0           0     0
    ##   text2           1           0     0
    ##   text3           0           0     0
    ##   text4           0           1     0
    ##   text5           0           0     1
    

    If your ngram concatenator is a different character, just substitute that for the _.