Search code examples
rtwitterstemming

Stemming Words in r: Missing Value


I am trying to do sentiment analysis of Tweets. While doing the pre-processing of words and creating a matrix, I got the following error:

Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") : 
missing value where TRUE/FALSE needed

From the 14215 tweets, I boiled it down to the specific tweet which produced the error but have got no clue how to prevent this error happening again. The tweet due to which error occured is (and code to reproduce the error):

library(RTextTools)
tweet<-"demonio leg edge sexy we get it u vape PLEASE COME TO NA SOON I HAVE A LUCIEL READY FOR U dominos"
all_tweets= create_matrix(tweet, language="english", minWordLength = 3, 
                      removeStopwords=TRUE, removeNumbers=TRUE,  # we can also removeSparseTerms
                      stemWords=TRUE,removePunctuation = TRUE,removeSparseTerms = 0)

I would first like to understand the error - why it occured and then what I desire is a method which would enable me to prevent this error from occuring - either by selecting and removing such tweets or by editing my create_matrix function in such a way?


Solution

  • The error comes from executing

    wordStem(
      c("demonio", "leg", "edge", "sexy", 
      "get", "u", "vape", "please", 
      "come", NA, "soon", "luciel", 
      "ready", "u", "dominos")
    )
    # Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") : 
    #   missing value where TRUE/FALSE needed
    

    Maybe this is a bug. The character string "NA" seems to be tokenized into NA (missing value).

    As a workaround, use

    library(tm)
    all_tweets <- DocumentTermMatrix(
      Corpus(VectorSource(tweet)), 
      control = list(
       wordLengths = c(3, Inf), 
       stopwords=TRUE, 
       removeNumbers=TRUE, 
       stemming=TRUE,
       removePunctuation = TRUE
      )
    )
    

    My sessionInfo():

    R version 3.3.0 (2016-05-03)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1
    
    locale:
    [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
    [4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] RTextTools_1.4.2 SparseM_1.7     
    
    loaded via a namespace (and not attached):
     [1] Rcpp_0.12.5         splines_3.3.0       MASS_7.3-44         tau_0.0-18          prodlim_1.5.5       tm_0.6-2           
     [7] lattice_0.20-33     foreach_1.4.3       caTools_1.17.1      tools_3.3.0         nnet_7.3-11         parallel_3.3.0     
    [13] grid_3.3.0          ipred_0.9-5         glmnet_2.0-5        e1071_1.6-7         iterators_1.0.8     class_7.3-14       
    [19] survival_2.39-4     randomForest_4.6-12 Matrix_1.2-6        NLP_0.1-9           lava_1.4.3          bitops_1.0-6       
    [25] codetools_0.2-14    rsconnect_0.4.3     maxent_1.3.3.1      rpart_4.1-10        slam_0.1-32         tree_1.0-36