Search code examples
rtext-miningdata-cleaningsentiment-analysis

Is there an R function to clean via a custom dictionary


I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R. I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.
For example:

#[1] "never give up uouo cbbuk jeez"  

would become

#[1*] "never give up"  

as the words "never","give",and "up" are all in the custom dictionary. I have previously tried the following:

#Reading the custom dictionary as a function
    english.words  <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
    DF2 <- DF1[(english.words(DF1$Text)),]

but my result is a character list with one word. Any advice?


Solution

  • You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.

    DF1$Text1 <- sapply(strsplit(DF1$Text, '\\s+'), function(x) 
                        paste0(Filter(english.words, x), collapse = ' '))
    

    Here I have created a new column called Text1 with only english words, if you want to replace the original column you can save the output in DF1$Text.