Search code examples
rtextanalysisstringr

Struggling with removing words based on pattern (text analysis in R)


I'm new to text analysis. I have been struggling with a particular problem in R this past week. I am trying to figure out how to remove or replace all variations of a word in a string. For example, if the string is:

test <- c("development", "develop", "developing", "developer", "apples", "kiwi")

I want the end output to be:

"apples", "kiwi"

So, basically, I'm trying to figure out how to remove or replace all words beginning with "^develop". I have tried using str_remove_all in the stringr package using this expression:

str_remove_all(test, "^dev")

But the end result was this:

"elopment", "elop", "eloping", "eloper", "apples", "kiwi"

It only removed parts of the word that matched the beginning expression "dev", whereas I want to remove the entire word if it matches the beginning of "dev".

Thanks!


Solution

  • Use grep with invert:

    grep("^develop", test, invert = TRUE, value = TRUE)
    ## [1] "apples" "kiwi"  
    

    or negate grepl:

    ok <- !grepl("^develop", test)
    test[ok]
    

    or remove develop and then retrieve those elements that have not changed:

    test[sub("^develop", "", test) == test]