Search code examples
rquanteda

How to remove single and double char tokens using quanteda::tokens_select()


I am trying to remove single and double char tokens.

here is an example:

toks <- tokens(c("This is a sentence. This is a second sentence."), remove_punct = TRUE)

toks <- tokens_select(toks, min_nchar=1L, max_nchar=2L, selection = "remove")

toks

Results:

tokens from 1 document. text1 :

[1] "is" "a" "is" "a"

I expect to get the tokens that do not meet the criteria, instead of the ones that meet.


Solution

  • It looks like the selection argument is ignored.

    This gives the results I wanted.

    toks <- tokens_select(toks, min_nchar=3L, max_nchar=79L)