Search code examples
rregexstringgsub

Remove all words in string containing punctuation (R)


How (in R) would I remove any word in a string containing punctuation, keeping words without?

  test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"

  desired <- "I a see works not"

Solution

  • Here is an approach using sub which seems to work:

    test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
    gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)
    
    [1] "I a see works not"
    

    This approach is to use the following regex pattern:

    [A-Za-z]*     match a leading letter zero or more times
    [^A-Za-z ]    then match a symbol once (not a space character or a letter)
    \\S*          followed by any other non whitespace character
    \\s*          followed by any amount of whitespace
    

    Then, we just replace with empty string, to remove the words having one or more symbols in them.