I am trying to implement quanteda on my corpus in R, but I am getting:
Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, :
duplicate row.names: character(0)
I don't have much experience with this. Here is a download of the dataset: https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0
Here is the code:
tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)
quanteda.corpus <- corpus(corpus)
The processing that you're doing with tm is preparing a object for tm and quanteda doesn't know what to do with it...quanteda does all of these steps itself, help("dfm"), as can be seen from the options.
If you try the following you can move ahead:
dfm(tweets$Tweet, verbose = TRUE, toLower= TRUE, removeNumbers = TRUE, removePunct = TRUE,removeTwitter = TRUE, language = "english", ignoredFeatures=stopwords("english"), stem=TRUE)
Creating a dfm from a character vector ... ... lowercasing ... tokenizing ... indexing documents: 6,943 documents ... indexing features: 15,164 feature types ... removed 161 features, from 174 supplied (glob) feature types ... stemming features (English), trimmed 2175 feature variants ... created a 6943 x 12828 sparse dfm ... complete. Elapsed time: 0.756 seconds. HTH