I get the following message. Using R 3.6.3, RStudio 1.2.5042, and Quanteda 2.0.1.
corpus.dfm <- dfm(corpus, remove_twitter = TRUE)
'remove_twitter' is deprecated; for FALSE, use 'what = "word"' instead.
I understand what deprecated means in context, but I don't understand the second part: use 'what = "word"' instead. Could an experienced user clarify, please?
Thank you.
I admit the deprecation message is not the most helpful, but the idea is that we have changed the default tokenizer behaviour in v2. what = "word"
now preserves social media tags (@username and #hashtag) and there is no option to remove the @
or #
from the tags, with the what = "word"
(the default).
To remove the tag symbols, you need to use what = "word1"
(the pre v2 default) or now, use any other tokenizer that creates a list output, for instance the word tokenizer from the tokenizers package.
library("quanteda")
## Package version: 2.0.1
txt <- "This is a @username and #hashtag."
# preserve social media tags (default)
tokens(txt, remove_punct = TRUE, what = "word")
## Tokens consisting of 1 document.
## text1 :
## [1] "This" "is" "a" "@username" "and" "#hashtag"
# remove social media tags (using tokenizers pkg)
tokenizers::tokenize_words(txt, lowercase = FALSE) %>%
tokens()
## Tokens consisting of 1 document.
## text1 :
## [1] "This" "is" "a" "username" "and" "hashtag"
# remove social media tags (using quanteda)
tokens(txt,
remove_twitter = TRUE, remove_punct = TRUE,
what = "word1"
)
## Warning: 'remove_twitter' is deprecated; for FALSE, use 'what = "word"' instead.
## Tokens consisting of 1 document.
## text1 :
## [1] "This" "is" "a" "username" "and" "hashtag"
Update regarding quanteda >= v2
This option has been removed in v2. The tokens
documentation now states:
In versions < 2, the argument
remove_twitter
controlled whether social media tags were preserved or removed, even whenremove_punct = TRUE
. This argument is not longer functional in versions >= 2. If greater control over social media tags is desired, you should user an alternative tokenizer, including non-quanteda options.
So now, these symbols are always preserved by quanteda's default tokeniser:
> tokens("This is a #hashtag and @username.")
Tokens consisting of 1 document.
text1 :
[1] "This" "is" "a" "#hashtag" "and" "@username" "."