I'm starting my first text analysis project in R using Twitter data and in the pre-processing stage I'm trying to remove all values that appear within quotation marks. I've found some code that removes the quotation marks themselves, but not the values inside it (e.g., "Hello World" becomes Hello World) but nothing that consistently removes the values AND the quotations marks (e.g., This is a "quoted text" becomes This is a).
I've anonymised an example data frame that I'm working with (with exact formatting for these particular tweets retained, just the content changed):
df <- data.frame(text = c("Example: “This is a quote!” https://t.co/ - MORE TEXT - example: “more text... “quote inside a quote” finished.”",
"Text \"this is a quote.\" More text. https://t.co/"))
For this dataframe, the aim is to end up with:
Example: https://t.co/ - MORE TEXT - example:
Text More text. https://t.co/
I've tried these:
df$text <- gsub('"[^"]+"', '', df$text)
df$text <- gsub('".*"', '', df$text)
df$text <- gsub("[\"'].*['\"]","", df$text)
But I find it only works on successfully removing the quotation from the second observation, not the first. I suspect it might have something to do with how the second quote has been imported from Twitter, enclosed with \ . I'm not sure if this hypothesis is correct though, and if it is, I'm not sure how to overcome it. Any help would be greatly appreciated!
Here's a solution using a one-liner pattern:
library(tidyverse)
df %>%
mutate(text = str_remove_all(text, '"[^"]+"|“[^“”]+”|“.+”'))
text
1 Example: https://t.co/ - MORE TEXT - example:
2 Text More text. https://t.co/
The pattern takes care of the variability shown in text
using three alternative patterns:
"[^"]+"
: first alternative: remove simple quotes wrapped in "
“[^“”]+”
: second alternative: remove simple quotes wrapped in “
and ”
“.+”
: third alternative: remove what was the parent quote of a nested quote wrapped in “
and ”
If in the actual data there are also nested " "
quotes, this could be accounted for with yet another alternation.