Search code examples
rperformancetext-miningtm

Remove stopwords and tolower function slow on a Corpus in R


I have corpus roughly with 75 MB data. I am trying to use the following command

tm_map(doc.corpus, removeWords, stopwords("english"))
tm_map(doc.corpus, tolower)

This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model.

I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed.

I have a system with 4GB RAM and running a local database to read the input data.

Hoping for suggestions to speed up!


Solution

  • Maybe you can give quanteda a try

    library(stringi)
    library(tm)
    library(quanteda)
    
    txt <- stri_rand_lipsum(100000L)
    print(object.size(txt), units = "Mb")
    # 63.4 Mb
    
    system.time(
      dfm <- dfm(txt, toLower = TRUE, ignoredFeatures = stopwords("en")) 
    )
    # Elapsed time: 12.3 seconds.
    #        User      System verstrichen 
    #       11.61        0.36       12.30 
    
    system.time(
      dtm <- DocumentTermMatrix(
        Corpus(VectorSource(txt)), 
        control = list(tolower = TRUE, stopwords = stopwords("en"))
      )
    )
    #  User      System verstrichen 
    # 157.16        0.38      158.69