Search code examples
rsupervised-learning

Rtexttools Trouble creating document term matrix with create_matrix


I'm using RTextTools for the first time. Here's my code for create_matrix

library(RTextTools)
texts <- c("This is the first document.", 
          "Is this a text?", 
        "This is the second file.", 
        "This is the third text.", 
        "File is not this.") 
doc_matrix <- create_matrix(texts, language="english", removeNumbers=FALSE, stemWords=TRUE, removeSparseTerms=.2)

I'm getting the following error(s):

Error in `[.simple_triplet_matrix`(matrix, , sort(colnames(matrix))) : 
Invalid subscript type: NULL.
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(j) : is.na() applied to non-(list or vector) of type 'NULL'

I haven't seen anyone else post this error yet, and figure there's something very basic which I am missing.

Peter


Solution

  • You need to remove the final argument, removeSparseTerms=.2) From the tm package documentation on removeSparseTerms: "A term-document matrix where those terms from x are removed which have at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse."

    I think the sparseness threshold is too low for your data set.