Search code examples
quantedadfm

What is the new method for "remove_twitter" in dfm (Quanteda)?


I get the following message. Using R 3.6.3, RStudio 1.2.5042, and Quanteda 2.0.1.

corpus.dfm <- dfm(corpus, remove_twitter = TRUE)
'remove_twitter' is deprecated; for FALSE, use 'what = "word"' instead. 

I understand what deprecated means in context, but I don't understand the second part: use 'what = "word"' instead. Could an experienced user clarify, please?

Thank you.


Solution

  • I admit the deprecation message is not the most helpful, but the idea is that we have changed the default tokenizer behaviour in v2. what = "word" now preserves social media tags (@username and #hashtag) and there is no option to remove the @ or # from the tags, with the what = "word" (the default).

    To remove the tag symbols, you need to use what = "word1" (the pre v2 default) or now, use any other tokenizer that creates a list output, for instance the word tokenizer from the tokenizers package.

    library("quanteda")
    ## Package version: 2.0.1
    
    txt <- "This is a @username and #hashtag."
    
    # preserve social media tags (default)
    tokens(txt, remove_punct = TRUE, what = "word")
    ## Tokens consisting of 1 document.
    ## text1 :
    ## [1] "This"      "is"        "a"         "@username" "and"       "#hashtag"
    
    # remove social media tags (using tokenizers pkg)
    tokenizers::tokenize_words(txt, lowercase = FALSE) %>%
      tokens()
    ## Tokens consisting of 1 document.
    ## text1 :
    ## [1] "This"     "is"       "a"        "username" "and"      "hashtag"
    
    # remove social media tags (using quanteda)
    tokens(txt,
      remove_twitter = TRUE, remove_punct = TRUE,
      what = "word1"
    )
    ## Warning: 'remove_twitter' is deprecated; for FALSE, use 'what = "word"' instead.
    ## Tokens consisting of 1 document.
    ## text1 :
    ## [1] "This"     "is"       "a"        "username" "and"      "hashtag"
    

    Update regarding quanteda >= v2

    This option has been removed in v2. The tokens documentation now states:

    In versions < 2, the argument remove_twitter controlled whether social media tags were preserved or removed, even when remove_punct = TRUE. This argument is not longer functional in versions >= 2. If greater control over social media tags is desired, you should user an alternative tokenizer, including non-quanteda options.

    So now, these symbols are always preserved by quanteda's default tokeniser:

    > tokens("This is a #hashtag and @username.")
    Tokens consisting of 1 document.
    text1 :
    [1] "This"      "is"        "a"         "#hashtag"  "and"       "@username" "."