I am working on a text analysis with R and have a dataset (text corpus) with various sentences about different fruits. For example: "apple", "banana" , "orange", "pear", etc.
Since it is not relevant for the analysis whether someone writes about "apples" or "bananas", I want to replace all different fruits with one specific word, for example "allfruits".
I thought about using regex but I am facing two issues;
1) I want to avoid separate code lines for each kind of fruit. Thus, is there a way to define a list or a vector that I can use so that the function replaces all words in that list (apple, bananas, pear, etc.) with one specific word "allfruits"?
2) I want to avoid that words that are NOT a fruit but contain the same string as a fruit (e.g. the word "appletini) get replaced by the function.
Example: If I have a sentence that says: "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!" I want following to happen: allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!
I am not sure whether it is possible to write this with a gsub function. Thus, all help is much appreciated.
Thank you!
allfruits
can be extended to contain any words to be replaced:
allfruits = c("apple", "banana" , "orange", "pear")
replacement = "allfruits"
text = "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!"
gsub(paste0("\\b(", paste0(allfruits, collapse="|"), ")[s]?\\b"), replacement, text, ignore.case = TRUE)
Returns
[1] "allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!"
The regex:
\\b
- wordboundary(", paste0(allfruits, collapse="|"), ")
- all fruits names separated by a |
(or)s?
- optional letter 's'\\b
- wordboundaryignore.case = TRUE
- ignore case