r regex nlp data-cleaning topic-modeling

Replace a list of words with one unique word in R

I am working on a text analysis with R and have a dataset (text corpus) with various sentences about different fruits. For example: "apple", "banana" , "orange", "pear", etc.

Since it is not relevant for the analysis whether someone writes about "apples" or "bananas", I want to replace all different fruits with one specific word, for example "allfruits".

I thought about using regex but I am facing two issues;

1) I want to avoid separate code lines for each kind of fruit. Thus, is there a way to define a list or a vector that I can use so that the function replaces all words in that list (apple, bananas, pear, etc.) with one specific word "allfruits"?

2) I want to avoid that words that are NOT a fruit but contain the same string as a fruit (e.g. the word "appletini) get replaced by the function.

Example: If I have a sentence that says: "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!" I want following to happen: allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!

I am not sure whether it is possible to write this with a gsub function. Thus, all help is much appreciated.

Thank you!

Solution

allfruits can be extended to contain any words to be replaced:

allfruits = c("apple", "banana" , "orange", "pear")
replacement = "allfruits"
text = "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!"

gsub(paste0("\\b(", paste0(allfruits, collapse="|"), ")[s]?\\b"), replacement, text, ignore.case = TRUE)

Returns

[1] "allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!"

The regex:

\\b - wordboundary
(", paste0(allfruits, collapse="|"), ") - all fruits names separated by a | (or)
s? - optional letter 's'
\\b - wordboundary
ignore.case = TRUE - ignore case