Search code examples
rregexstringrtidytext

replace string from tibble with part of that string


I have searched a lot of regex answers here, but can't find the solution to this kind of problem.

My dataset is a tibble with wikipedia links:

library(tidytext)
library(stringr)
text.raw <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."

I'm trying to clean up my text from the links. This:

str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])")
# [1] "Duits"     "architect"

Selects the words i need from between the brackets.

This:

str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# [1] "Berthold Speer was een Duits Duits."

works as expected, but not quite what i need. This:

str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# Error: `replacement` must be a character vector

gives an error where i expected "Berthold Speer was een Duits architect"

Currently my code looks something like this:

text.clean <- data_frame(text = text.raw) %>%
  mutate(text = str_replace_all(text, "\\[\\[.*?\\]\\]", str_extract_all(text, "[a-zA-Z\\s]+(?=\\])")))

I hope someone knows a solution, or can point me to a duplicate question if there exists one. My desired output is "Berthold Speer was een Duits architect".


Solution

  • You may use a single gsub operation

    text <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."
    gsub("\\[{2}(?:[^]|]*\\|)?([^]]*)]{2}", "\\1", text)
    

    See the online R demo.

    The pattern will match

    • \\[{2} - two [ symbols
    • (?:[^]|]*\\|)? - an optional sequence matching
      • [^]|]* - zero or more chars other than ] and |
      • \\| - a pipe symbol
    • ([^]]*) - Group 1: zero or more chars other than ]
    • ]{2} - two ] symbols.