(Fast) word frequency matrix in R

I am writing an R program that involves analyzing a large amount of unstructured text data and creating a word-frequency matrix. I've been using the wfm and wfdf functions from the qdap package, but have noticed that this is a bit slow for my needs. It appears that the production of the word-frequency matrix is the bottleneck.

The code for my function is as follows.

liwcr <- function(inputText, dict) {
    stop("Dictionary file does not exist.")

  # Read in dictionary categories
  # Start by figuring out where the category list begins and ends
  dictionaryText <- readLines(dict)
  if(!length(grep("%", dictionaryText))==2)
    stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")

  catStart <- grep("%", dictionaryText)[1]
  catStop <- grep("%", dictionaryText)[2]
  dictLength <- length(dictionaryText)

  dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))

  wordCount <- word_count(inputText)

  outputFrame <- dictionaryCategories
  outputFrame["count"] <- 0

  # Now read in dictionary words

  no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
  dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)

  workingMatrix <- wfdf(inputText)
  for (i in workingMatrix[,1]) {
    if (i %in% dictionaryWords[, 1]) {
      occurrences <- 0
      foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
      foundCategories <- foundWord[1,2:no_col]
      for (w in foundCategories) {
        if (! & (!w=="")) {
          existingCount <- outputFrame[outputFrame$V1 == w,]$count
          outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all

I realize the for loop is inefficient, so in an effort to locate the bottleneck, I tested it without this portion of the code (simply reading in each text file and producing the word-frequency matrix), and seen very little in the way of speed improvements. Example:

fn <- reports::folder(delete_me)
n <- 10000

lapply(1:n, function(i) {
    out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
    cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))

filename <- sprintf("tweet%s.txt", 1:n)

for(i in 1:length(filename)){
  text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
  freq <- wfm(text)

The input files are Twitter and Facebook status postings.

Is there any way to improve the speed for this code?

EDIT2: Due to institutional restrictions, I can't post any of the raw data. However, just to give an idea of what I'm dealing with: 25k text files, each with all the available tweets from an individual Twitter user. There are also an additional 100k files with Facebook status updates, structured in the same way.


  • Here is a qdap approach and a mixed qdap/tm approach that is faster. I provide the code and then the timings on each. Basically I read everything in at once and operator on the entire data set. You could then split it back apart if you wanted with split.

    A MWE that you should provide with questions

    fn <- reports::folder(delete_me)
    n <- 10000
    lapply(1:n, function(i) {
        out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
        cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
    filename <- sprintf("tweet%s.txt", 1:n)

    The qdap approach

    tic <- Sys.time() ## time it
    dat <- list2df(setNames(lapply(filename, function(x){
        readLines(file.path(fn, x))
    }), tools::file_path_sans_ext(filename)), "text", "tweet")
    difftime(Sys.time(), tic) ## time to read in
    the_wfm <- with(dat, wfm(text, tweet))
    difftime(Sys.time(), tic)  ## time to make wfm

    Timing qdap approach

    > tic <- Sys.time() ## time it
    > dat <- list2df(setNames(lapply(filename, function(x){
    +     readLines(file.path(fn, x))
    + }), tools::file_path_sans_ext(filename)), "text", "tweet")
    There were 50 or more warnings (use warnings() to see the first 50)
    > difftime(Sys.time(), tic) ## time to read in
    Time difference of 2.97617 secs
    > the_wfm <- with(dat, wfm(text, tweet))
    > difftime(Sys.time(), tic)  ## time to make wfm
    Time difference of 48.9238 secs

    The qdap-tm combined approach

    tic <- Sys.time() ## time it
    dat <- list2df(setNames(lapply(filename, function(x){
        readLines(file.path(fn, x))
    }), tools::file_path_sans_ext(filename)), "text", "tweet")
    difftime(Sys.time(), tic) ## time to read in
    tweet_corpus <- with(dat, as.Corpus(text, tweet))
    tdm <- tm::TermDocumentMatrix(tweet_corpus,
        control = list(removePunctuation = TRUE,
        stopwords = FALSE))
    difftime(Sys.time(), tic)  ## time to make TermDocumentMatrix

    Timing qdap-tm combined approach

    > tic <- Sys.time() ## time it
    > dat <- list2df(setNames(lapply(filename, function(x){
    +     readLines(file.path(fn, x))
    + }), tools::file_path_sans_ext(filename)), "text", "tweet")
    There were 50 or more warnings (use warnings() to see the first 50)
    > difftime(Sys.time(), tic) ## time to read in
    Time difference of 3.108177 secs
    > tweet_corpus <- with(dat, as.Corpus(text, tweet))
    > tdm <- tm::TermDocumentMatrix(tweet_corpus,
    +     control = list(removePunctuation = TRUE,
    +     stopwords = FALSE))
    > difftime(Sys.time(), tic)  ## time to make TermDocumentMatrix
    Time difference of 13.52377 secs

    There is a qdap-tm Package Compatibility (-CLICK HERE-) to help users move between qdap and tm. As you can see on 10000 tweets the combined approach is ~3.5 x faster. A purely tm approach may be faster still. Also if you want the wfm use as.wfm(tdm) to coerce the TermDocumentMatrix.

    Your code though is slower either way because it's not the R way to do things. I'd recommend reading some additional info on R to get better at writing faster code. I'm currently working through Hadley Wickham's Advanced R that I'd recommend.