Search code examples
regexroracle-databasetextmining

how to delete words from a list in a column in R


I have a column of titles in a table and would like to delete all words that are listed in a separate table/vector.

For example, table of titles:

"Lorem ipsum dolor"
"sit amet, consectetur adipiscing"
"elit, sed do eiusmod tempor"
"incididunt ut labore"
"et dolore magna aliqua."

To be deleted: c("Lorem", "dolore", "elit")

output:

"ipsum dolor"
"sit amet, consectetur adipiscing"
", sed do eiusmod tempor"
"incididunt ut labore"
"et magna aliqua."

The blacklisted words can occur multiple times.

The tm package has this functionality, but when applied to a wordcloud. What I would need is to leave the column intact rather than joining all the rows into one string of characters. Regex functions (gsub())don't seem to function when given a set of values as a pattern. An Oracle SQL solution would also be interesting.


Solution

  • First read the data:

    dat <- c("Lorem ipsum dolor",
               "sit amet, consectetur adipiscing",
               "elit, sed do eiusmod tempor",
               "incididunt ut labore",
               "et dolore magna aliqua.")
    todelete <- c("Lorem", "dolore", "elit")
    

    We can avoid loops with a little smart pasting. The | is an or so we can paste it in, allowing us to remove any loops:

    gsub(paste0(todelete, collapse = "|"), "", dat)