I am writing an R program that involves analyzing a large amount of unstructured text data and creating a word-frequency matrix. I've been using the wfm
and wfdf
functions from the qdap
package, but have noticed that this is a bit slow for my needs. It appears that the production of the word-frequency matrix is the bottleneck.
The code for my function is as follows.
library(qdap)
liwcr <- function(inputText, dict) {
if(!file.exists(dict))
stop("Dictionary file does not exist.")
# Read in dictionary categories
# Start by figuring out where the category list begins and ends
dictionaryText <- readLines(dict)
if(!length(grep("%", dictionaryText))==2)
stop("Dictionary is not properly formatted. Make sure category list is correctly partitioned (using '%').")
catStart <- grep("%", dictionaryText)[1]
catStop <- grep("%", dictionaryText)[2]
dictLength <- length(dictionaryText)
dictionaryCategories <- read.table(dict, header=F, sep="\t", skip=catStart, nrows=(catStop-2))
wordCount <- word_count(inputText)
outputFrame <- dictionaryCategories
outputFrame["count"] <- 0
# Now read in dictionary words
no_col <- max(count.fields(dict, sep = "\t"), na.rm=T)
dictionaryWords <- read.table(dict, header=F, sep="\t", skip=catStop, nrows=(dictLength-catStop), fill=TRUE, quote="\"", col.names=1:no_col)
workingMatrix <- wfdf(inputText)
for (i in workingMatrix[,1]) {
if (i %in% dictionaryWords[, 1]) {
occurrences <- 0
foundWord <- dictionaryWords[dictionaryWords$X1 == i,]
foundCategories <- foundWord[1,2:no_col]
for (w in foundCategories) {
if (!is.na(w) & (!w=="")) {
existingCount <- outputFrame[outputFrame$V1 == w,]$count
outputFrame[outputFrame$V1 == w,]$count <- existingCount + workingMatrix[workingMatrix$Words == i,]$all
}
}
}
}
return(outputFrame)
}
I realize the for loop is inefficient, so in an effort to locate the bottleneck, I tested it without this portion of the code (simply reading in each text file and producing the word-frequency matrix), and seen very little in the way of speed improvements. Example:
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
for(i in 1:length(filename)){
print(filename[i])
text <- readLines(paste0("/toshi/twitter_en/", filename[i]))
freq <- wfm(text)
}
The input files are Twitter and Facebook status postings.
Is there any way to improve the speed for this code?
EDIT2: Due to institutional restrictions, I can't post any of the raw data. However, just to give an idea of what I'm dealing with: 25k text files, each with all the available tweets from an individual Twitter user. There are also an additional 100k files with Facebook status updates, structured in the same way.
Here is a qdap
approach and a mixed qdap/tm
approach that is faster. I provide the code and then the timings on each. Basically I read everything in at once and operator on the entire data set. You could then split it back apart if you wanted with split
.
A MWE that you should provide with questions
library(qdap)
fn <- reports::folder(delete_me)
n <- 10000
lapply(1:n, function(i) {
out <- paste(sample(key.syl[[1]], 30, T), collapse = " ")
cat(out, file=file.path(fn, sprintf("tweet%s.txt", i)))
})
filename <- sprintf("tweet%s.txt", 1:n)
The qdap approach
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
the_wfm <- with(dat, wfm(text, tweet))
difftime(Sys.time(), tic) ## time to make wfm
Timing qdap approach
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 2.97617 secs
>
> the_wfm <- with(dat, wfm(text, tweet))
>
> difftime(Sys.time(), tic) ## time to make wfm
Time difference of 48.9238 secs
The qdap-tm combined approach
tic <- Sys.time() ## time it
dat <- list2df(setNames(lapply(filename, function(x){
readLines(file.path(fn, x))
}), tools::file_path_sans_ext(filename)), "text", "tweet")
difftime(Sys.time(), tic) ## time to read in
tweet_corpus <- with(dat, as.Corpus(text, tweet))
tdm <- tm::TermDocumentMatrix(tweet_corpus,
control = list(removePunctuation = TRUE,
stopwords = FALSE))
difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Timing qdap-tm combined approach
> tic <- Sys.time() ## time it
>
> dat <- list2df(setNames(lapply(filename, function(x){
+ readLines(file.path(fn, x))
+ }), tools::file_path_sans_ext(filename)), "text", "tweet")
There were 50 or more warnings (use warnings() to see the first 50)
>
> difftime(Sys.time(), tic) ## time to read in
Time difference of 3.108177 secs
>
>
> tweet_corpus <- with(dat, as.Corpus(text, tweet))
>
> tdm <- tm::TermDocumentMatrix(tweet_corpus,
+ control = list(removePunctuation = TRUE,
+ stopwords = FALSE))
>
> difftime(Sys.time(), tic) ## time to make TermDocumentMatrix
Time difference of 13.52377 secs
There is a qdap-tm Package Compatibility (-CLICK HERE-) to help users move between qdap and tm. As you can see on 10000 tweets the combined approach is ~3.5 x faster. A purely tm
approach may be faster still. Also if you want the wfm
use as.wfm(tdm)
to coerce the TermDocumentMatrix
.
Your code though is slower either way because it's not the R way to do things. I'd recommend reading some additional info on R to get better at writing faster code. I'm currently working through Hadley Wickham's Advanced R that I'd recommend.