I would like to use a custom dictionary (upwards of 400,000 words) when cleaning my data in R. I already have the dictionary loaded as a large character list and I am trying to have it so that the content within my data (VCorpus) compromises of only the words in my dictionary.
For example:
#[1] "never give up uouo cbbuk jeez"
would become
#[1*] "never give up"
as the words "never","give",and "up" are all in the custom dictionary. I have previously tried the following:
#Reading the custom dictionary as a function
english.words <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
DF2 <- DF1[(english.words(DF1$Text)),]
but my result is a character list with one word. Any advice?
You can split the sentences into words, keep only words that are part of your dictionary and paste them in one sentence again.
DF1$Text1 <- sapply(strsplit(DF1$Text, '\\s+'), function(x)
paste0(Filter(english.words, x), collapse = ' '))
Here I have created a new column called Text1
with only english words, if you want to replace the original column you can save the output in DF1$Text
.