Search code examples
rstringrtidytext

Removing specific text R


I have a character vector in a data frame in R which contains inbound email text. Most of the rows contain 'Dear x,' where x is any intended recipient and x can vary. There could also be typos such as the incorrect use of lowercase. Either way, the common feature is that they start with the word 'dear' (upper or lowercase) and end in a comma.


df <- data.frame(emails = c("Dear dave, I have seen what you...", "Dear Mr Smith, I recieved your reply...", "dear stu, I note that you have not..."),
                 account = c(534, 434, 544)
)

df

                                   emails account
1      Dear dave, I have seen what you...     534
2 Dear Mr Smith, I recieved your reply...     434
3   dear stu, I note that you have not...     544

I am looking to trim off the email intro to just start with the main body of text so it looks like the one below.

                          emails   account
1        I have seen what you...   534
2       I recieved your reply...   434
3    I note that you have not...   544

Solution

  • Using trimws in base R

    df$emails <-  trimws(df$emails, whitespace = "[Dd]ear[^,]+,\\s+")
    

    -output

    df$emails
    [1] "I have seen what you..."     "I recieved your reply..."    "I note that you have not..."