Search code examples
rregex-groupbackreference

Case-sensitive hyphenation replacement with regex


I'm trying to clean up some text in R with German input.

library(tidyverse)
bye_bye_hyphenation <- function(x){
  # removes words separated by hyphenation f.e. due to PDF input
  # eliminate line breaks
  # first group for characters (incl. European ones) (\\1), dash and following whitespace,
  # second group for characters (\\2) (incl. European ones)
  stringr::str_replace_all(x, "([a-z|A-Z\x7f-\xff]{1,})\\-[\\s]{1,}([a-z|A-Z\x7f-\xff]{1,})", "\\1\\2")
}

# this works correctly
"Ex-\n ample" %>% 
  bye_bye_hyphenation()
#> [1] "Example"

# this should stay the same, `Regierungsund` should not be
# concatenated
"Regierungs- und Verwaltungsgesetz" %>%
  bye_bye_hyphenation()
#> [1] "Regierungsund Verwaltungsgesetz"

Created on 2019-06-19 by the reprex package (v0.3.0)

Does someone know how to make this whole Regex case-sensitive, such that it won't trigger in the second case, that is whenever the word und appears after a dash and a space?


Solution

  • Perhaps you could use negative or positive lookaheads (see e.g. Regex lookahead, lookbehind and atomic groups). The regex below removes a dash followed by a potential line break or space if it is not followed by the word "und" and removes only a line break otherwise:

    library(stringr)
    
    string1 <- "Ex- ample"
    string2 <- "Ex-\n ample"
    string3 <- "Regierungs- und Verwaltungsgesetz"
    string4 <- "Regierungs-\n und Verwaltungsgesetz"
    
    pattern <- "(-\\n?\\s?(?!\\n?\\s?und))|(\\n(?=\\s?und))"
    
    str_remove(string1, pattern)
    #> [1] "Example"
    str_remove(string2, pattern)
    #> [1] "Example"
    str_remove(string3, pattern)
    #> [1] "Regierungs- und Verwaltungsgesetz"
    str_remove(string4, pattern)
    #> [1] "Regierungs- und Verwaltungsgesetz"
    

    Created on 2019-06-19 by the reprex package (v0.3.0)