Search code examples
rregexgsubtmstringr

R: removing part of the word in a character string


I have a character vector

words <- c("somethingspan.", "..span?", "spanthank", "great to hear", "yourspan")

And I'm trying to remove span AND punctuation from every word in the vector

> something thank great to hear your

The thing is, there's no rule if span will appear before or after the word I'm interested in. Also, span can be glued to: i) characters only (e.g. yourspan), punctuation only (e.g. ..span?) or character and punctuation (e.g. somethingspan.).

I searched SO for the answer, but usually I see request to remove whole words (like here ) or elements of the string after/before a letter/punctuation (like here )

Any help will be appreciated


Solution

  • You may use

    [[:punct:]]*span[[:punct:]]*
    

    See the regex demo.

    Details

    • [[:punct:]]* - 0+ punctuations chars
    • span - a literal substring
    • [[:punct:]]* - 0+ punctuations chars

    R Demo:

    words <- c("somethingspan.", "..span?", "spanthank", "great to hear", "yourspan")
    words <- gsub("[[:punct:]]*span[[:punct:]]*", "", words) # Remove spans
    words <- words[words != ""] # Discard empty elements
    paste(words, collapse=" ")  # Concat the elements
    ## => [1] "something thank great to hear your"
    

    If there result whitespace only elements after removing unwanted strings, you may replace the second step with words <- words[trimws(words) != ""] (instead of words[words != ""]).