I have searched a lot of regex answers here, but can't find the solution to this kind of problem.
My dataset is a tibble with wikipedia links:
library(tidytext)
library(stringr)
text.raw <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."
I'm trying to clean up my text from the links. This:
str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])")
# [1] "Duits" "architect"
Selects the words i need from between the brackets.
This:
str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# [1] "Berthold Speer was een Duits Duits."
works as expected, but not quite what i need. This:
str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# Error: `replacement` must be a character vector
gives an error where i expected "Berthold Speer was een Duits architect"
Currently my code looks something like this:
text.clean <- data_frame(text = text.raw) %>%
mutate(text = str_replace_all(text, "\\[\\[.*?\\]\\]", str_extract_all(text, "[a-zA-Z\\s]+(?=\\])")))
I hope someone knows a solution, or can point me to a duplicate question if there exists one. My desired output is "Berthold Speer was een Duits architect"
.
You may use a single gsub operation
text <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."
gsub("\\[{2}(?:[^]|]*\\|)?([^]]*)]{2}", "\\1", text)
See the online R demo.
The pattern will match
\\[{2}
- two [
symbols(?:[^]|]*\\|)?
- an optional sequence matching
[^]|]*
- zero or more chars other than ]
and |
\\|
- a pipe symbol([^]]*)
- Group 1: zero or more chars other than ]
]{2}
- two ]
symbols.