I am trying to do sentiment analysis of Tweets. While doing the pre-processing of words and creating a matrix, I got the following error:
Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") :
missing value where TRUE/FALSE needed
From the 14215 tweets, I boiled it down to the specific tweet which produced the error but have got no clue how to prevent this error happening again. The tweet due to which error occured is (and code to reproduce the error):
library(RTextTools)
tweet<-"demonio leg edge sexy we get it u vape PLEASE COME TO NA SOON I HAVE A LUCIEL READY FOR U dominos"
all_tweets= create_matrix(tweet, language="english", minWordLength = 3,
removeStopwords=TRUE, removeNumbers=TRUE, # we can also removeSparseTerms
stemWords=TRUE,removePunctuation = TRUE,removeSparseTerms = 0)
I would first like to understand the error - why it occured and then what I desire is a method which would enable me to prevent this error from occuring - either by selecting and removing such tweets or by editing my create_matrix function in such a way?
The error comes from executing
wordStem(
c("demonio", "leg", "edge", "sexy",
"get", "u", "vape", "please",
"come", NA, "soon", "luciel",
"ready", "u", "dominos")
)
# Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") :
# missing value where TRUE/FALSE needed
Maybe this is a bug. The character string "NA" seems to be tokenized into NA
(missing value).
As a workaround, use
library(tm)
all_tweets <- DocumentTermMatrix(
Corpus(VectorSource(tweet)),
control = list(
wordLengths = c(3, Inf),
stopwords=TRUE,
removeNumbers=TRUE,
stemming=TRUE,
removePunctuation = TRUE
)
)
My sessionInfo()
:
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RTextTools_1.4.2 SparseM_1.7
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 splines_3.3.0 MASS_7.3-44 tau_0.0-18 prodlim_1.5.5 tm_0.6-2
[7] lattice_0.20-33 foreach_1.4.3 caTools_1.17.1 tools_3.3.0 nnet_7.3-11 parallel_3.3.0
[13] grid_3.3.0 ipred_0.9-5 glmnet_2.0-5 e1071_1.6-7 iterators_1.0.8 class_7.3-14
[19] survival_2.39-4 randomForest_4.6-12 Matrix_1.2-6 NLP_0.1-9 lava_1.4.3 bitops_1.0-6
[25] codetools_0.2-14 rsconnect_0.4.3 maxent_1.3.3.1 rpart_4.1-10 slam_0.1-32 tree_1.0-36