I am trying to remove a list of stop words I created from a Corpus. I am not sure what is going on as I have removed all the special characters from the stop words list and have completed the text cleaning on the corpus. Any help would be greatly appreciated. The code and Error message are below. The csv with the user defined stops words is listed here: Stop Words
myCorpus <- Corpus(VectorSource(c("blank", "blank", "blank", "blank", "blank", "blank", "blank",
"blank", "blank", "blank", "blank", "blank", "blank", "<br />Key skills:<br />Octopus Deploy, MS Build, PowerShell, Azure, NuGet, CI / CD concepts, release management<br /><br /> * Minimum 5 years plus relevant experience in Application Development lifecycle, Automation and Release and Configuration Management<br /> * Considerable experience in the following disciplines - TFS (Team Foundation Server), DevOps, Continuous Delivery, Release Engineering, Application Architect, Database Architect, Information Modeling, Service Oriented Architecture (SOA), Quality Assurance, Branch Management, Network setup and troubleshooting, Server setup, configuration, maintenance and patching<br /> * Solid understanding of Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery<br /> * Solid understanding and experience working with high availability and high performance, multi-data center systems and hybrid cloud environments.<br /> * Proficient with Agile methodologies and working closely within small teams and vendors<br /> * Knowledge of Deployment and configuration automation platforms<br /> * Extensive PowerShell experience<br /> * Extensive knowledge of Windows based systems including hardware, software and .NET applications<br /> * Strong ability to troubleshoot complex issues ranging from system resources to application stack traces<br /><br />REQUIRED SKILLS:<br />Bachelor's degree & 5-10 years of relevant work experience.",
"blank")))
for (j in seq(myCorpus)) {
myCorpus[[j]] <- gsub("<.*>", " ", myCorpus[[j]])
myCorpus[[j]] <- gsub("\\b[[:alnum:]]{20,}\\b", " ", myCorpus[[j]], perl=T)
myCorpus[[j]] <- gsub("[[:punct:]]", " ", myCorpus[[j]])
}
#Clean Corpus
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, stripWhitespace)
#User defined stop word
manualStopwords <- read.csv("r_stop.csv", header = TRUE)
myStopwords <- paste(manualStopwords[,1])
myStopwords <- str_replace_all(myStopwords, "[[:punct:]]", "")
myStopwords <- gsub("\\+", "plus", myStopwords)
myStopwords <- gsub("\\$", "dollars", myStopwords)
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
First Error
Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : invalid regular expression '(*UCP)\b(zimmermann|yrs|yr|youve|.....the rest of the Stop Words
Additional Error
In addition: Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error 'regular expression is too large' at ''
I was able to break my stop words up into smaller buckets and the code ran. There might have been a problem with memory.
chunk <- 500
n <- length(myStopwords)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(myStopwords,r)
for (i in 1:length(d)) {
myCorpus <- tm_map(myCorpus, removeWords, c(paste(d[[i]])))
}