Search code examples
rmatrixtext-mining

How do I convert this corpus of words from an online book into a term document matrix?


Here is a snippet of my code:

library(gutenbergr)
library(tm)
Alice <- gutenberg_download(c(11))
Alice <- Corpus(VectorSource(Alice))
cleanAlice <- tm_map(Alice, removeWords, stopwords('english'))
cleanAlice <- tm_map(cleanAlice, removeWords, c('Alice'))
cleanAlice <- tm_map(cleanAlice, tolower)
cleanAlice <- tm_map(cleanAlice, removePunctuation)
cleanAlice <- tm_map(cleanAlice, stripWhitespace)
dtm1 <- TermDocumentMatrix(cleanAlice)
dtm1

But then I receive the following error:

<<TermDocumentMatrix (terms: 3271, documents: 2)>>
Non-/sparse entries: 3271/3271
Sparsity           : 50%
Error in nchar(Terms(x), type = "chars") : 
  invalid multibyte string, element 12

How should I deal with this? Should I convert the corpus into a plain text document first? Is there something wrong with the text format of the book?


Solution

  • Gutenbergr returns a data.frame, not a text vector. You just need to slightly adjust your code and it should work fine. Instead of VectorSource(Alice) you need VectorSource(Alice$text)

    library(gutenbergr)
    library(tm)
    
    # don't overwrite your download when you are testing
    Alice <- gutenberg_download(c(11))
    
    # specify the column in the data.frame
    Alice_corpus <- Corpus(VectorSource(Alice$text))
    cleanAlice <- tm_map(Alice_corpus, removeWords, stopwords('english'))
    cleanAlice <- tm_map(cleanAlice, removeWords, c('Alice'))
    cleanAlice <- tm_map(cleanAlice, tolower)
    cleanAlice <- tm_map(cleanAlice, removePunctuation)
    cleanAlice <- tm_map(cleanAlice, stripWhitespace)
    dtm1 <- TermDocumentMatrix(cleanAlice)
    dtm1
    
    <<TermDocumentMatrix (terms: 3293, documents: 3380)>>
    Non-/sparse entries: 13649/11116691
    Sparsity           : 100%
    Maximal term length: 46
    Weighting          : term frequency (tf)
    

    P.S. you can ignore the warning messages in the code.