I have a character vector in a data frame in R which contains inbound email text. Most of the rows contain 'Dear x,' where x is any intended recipient and x can vary. There could also be typos such as the incorrect use of lowercase. Either way, the common feature is that they start with the word 'dear' (upper or lowercase) and end in a comma.
df <- data.frame(emails = c("Dear dave, I have seen what you...", "Dear Mr Smith, I recieved your reply...", "dear stu, I note that you have not..."),
account = c(534, 434, 544)
)
df
emails account
1 Dear dave, I have seen what you... 534
2 Dear Mr Smith, I recieved your reply... 434
3 dear stu, I note that you have not... 544
I am looking to trim off the email intro to just start with the main body of text so it looks like the one below.
emails account
1 I have seen what you... 534
2 I recieved your reply... 434
3 I note that you have not... 544
Using trimws
in base R
df$emails <- trimws(df$emails, whitespace = "[Dd]ear[^,]+,\\s+")
-output
df$emails
[1] "I have seen what you..." "I recieved your reply..." "I note that you have not..."