Search code examples
rtext-mining

Problems with TermDocumentMatrix function in R


I'm trying to create a TermDocumentMatrix using tm package, but seem to have encountered difficulties.

The input:

trainDF<-as.matrix(list("I'm going home", "trying to fix this", "when I go home"))

Goal - creating a TDM from the input: (not all controls parameters listed below)

control <- list(
    weight= weightTfIdf, 
    removeNumbers=TRUE, 
    removeStopwords=TRUE, 
    removePunctuation=TRUE,    
    stemWords=TRUE, 
    maxWordLength=maxWordL,
    bounds=list(local=c(minDocFreq, maxDocFreq))
)

tdm<- TermDocumentMatrix(Corpus(DataframeSource(trainDF)),control = control)

The error I get:

Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

And the tdm object is empty. Any ideas?


Solution

  • The error suggests something is wrong with your choice of minimum and maximum document frequency in the bounds. For example, the following works:

    control=list(weighting = weightTfIdf,
                 removeNumbers=TRUE, 
                 removeStopwords=TRUE, 
                 removePunctuation=TRUE, 
                 bounds=list(local=c(1,3)))
    tdm<- TermDocumentMatrix(Corpus(DataframeSource(trainDF)), control=control)
    

    Note that in the latest versions of TM, To specify a weighting you need to use weighting = weightTfIdf rather than weight = weightTfIdf. Similarly, you should use stemming=TRUE in your control list to stem words. I'm not sure that maxWordLength is an option currently. TM will silently ignore invalid options in the control list, so you won't know that something is wrong until you go back to inspect the matrix.