Search code examples
rsampletm

How to sample 75 percent of rows from a dtm?


How I can sample a dtm? I trying a lot of code but return me the same error

Error in dtm[splitter, ] : incorrect number of dimensions

This is the code:

n <- dtm$nrow
splitter <- sample(1:n, round(n * 0.75))
train_set <- dtm[splitter, ]
valid_set <- dtm[-splitter, ]

Solution

  • You can use the quanteda package for this. See example below:

    Created data example based on the crude data set from tm:

    library(tm)
    
    data("crude")
    crude <- as.VCorpus(crude)
    crude <- tm_map(crude, stripWhitespace)
    crude <- tm_map(crude, removePunctuation)
    crude <- tm_map(crude, content_transformer(tolower))
    crude <- tm_map(crude, removeWords, stopwords("english"))
    crude <- tm_map(crude, stemDocument)
    
    dtm <- DocumentTermMatrix(crude)
    
    
    library(quanteda)
    
    # Transform your dtm into a dfm for quanteda
    my_dfm <- as.dfm(dtm)
    
    # number of documents    
    ndocs(my_dfm)
    [1] 20
    
    set.seed(4242)
    
    # create training
    train_set <- dfm_sample(my_dfm, 
                            size = round(ndoc(my_dfm) * 0.75),  # set sample size
                            margin = "documents")
    
    # create test set by select the documents that do not match the documents in the training set.
    test_set <- dfm_subset(my_dfm, !docnames(my_dfm) %in% docnames(train_set))
    
    # number of documents in train
    ndoc(train_set)
    [1] 15
    
    # number of documents in test
    ndoc(test_set)
    [1] 5
    

    Afterwards you can use the quanteda function convert to convert your train and test sets to be used with topicmodels, lda, lsa, etc. See ?convert for more info.