Search code examples
rregexnlpdata-cleaningtopic-modeling

Replace a list of words with one unique word in R


I am working on a text analysis with R and have a dataset (text corpus) with various sentences about different fruits. For example: "apple", "banana" , "orange", "pear", etc.

Since it is not relevant for the analysis whether someone writes about "apples" or "bananas", I want to replace all different fruits with one specific word, for example "allfruits".

I thought about using regex but I am facing two issues;

1) I want to avoid separate code lines for each kind of fruit. Thus, is there a way to define a list or a vector that I can use so that the function replaces all words in that list (apple, bananas, pear, etc.) with one specific word "allfruits"?

2) I want to avoid that words that are NOT a fruit but contain the same string as a fruit (e.g. the word "appletini) get replaced by the function.

Example: If I have a sentence that says: "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!" I want following to happen: allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!

I am not sure whether it is possible to write this with a gsub function. Thus, all help is much appreciated.

Thank you!


Solution

  • allfruits can be extended to contain any words to be replaced:

    allfruits = c("apple", "banana" , "orange", "pear")
    replacement = "allfruits"
    text = "Apple is my favourite fruit, appletini is my favourite drink. I also like bananas!"
    
    gsub(paste0("\\b(", paste0(allfruits, collapse="|"), ")[s]?\\b"), replacement, text, ignore.case = TRUE)
    

    Returns

    [1] "allfruits is my favourite fruit, appletini is my favourite drink. I also like allfruits!"
    

    The regex:

    • \\b - wordboundary
    • (", paste0(allfruits, collapse="|"), ") - all fruits names separated by a | (or)
    • s? - optional letter 's'
    • \\b - wordboundary
    • ignore.case = TRUE - ignore case