Search code examples
rtm

Remove meaningless words from corpus in R


I am using tm and wordcloud for performing some basic text mining in R. The text being processed contains many words which are meaningless like asfdg,aawptkr and i need to filter such words. The closest solution i have found is using library(qdapDictionaries) and building a custom function to check validity of words.

library(qdapDictionaries)
is.word  <- function(x) x %in% GradyAugmented

# example
> is.word("aapg")
[1] FALSE

The rest of text mining used is :

curDir <- "E:/folder1/"  # folder1 contains a.txt, b.txt
myCorpus <- VCorpus(DirSource(curDir))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)

myCorpus <- tm_map(myCorpus,foo) # foo clears meaningless words from corpus

The issue is is.word() works fine for handling dataframes but how to use it for corpus handling ?

Thanks


Solution

  • Not sure if it will be the most resource efficient method (I don't know the package very well) but it should work:

    tdm <- TermDocumentMatrix(myCorpus )
    all_tokens       <- findFreqTerms(tdm, 1)
    tokens_to_remove <- setdiff(all_tokens,GradyAugmented)
    corpus <- tm_map(corpus, content_transformer(removeWords), 
                     tokens_to_remove)