Search code examples
rregexstringi

Locate a sub-string with a general character '[a,z]-\n' and replace the non-general part of the sub-string '-\n'


I have text I am cleaning up in R. I want to use stringi, but am happy to use other packages.

Some of the words are broken over two lines. So I get a sub-string "halfword-\nsecondhalfword".

I also have strings that are just "----\nword" and " -\n" (and some others that I do not want to replace.

What I want to do is identify all sub-strings "[a-z]-\n" and then keep the generic letter [a,z], but remove the -\n characters.

I do not want to remove all -\n , and I do not want to remove the letter [a-z].

Thanks!


Solution

  • You may make use of word boundaries to match -<LF> only in between word characters:

    gsub("\\b-\n\\b", "", x)
    gsub("(*UCP)\\b-\n\\b", "", x, perl=TRUE)
    stringr::str_replace_all(x, "\\b-\n\\b", "", x)
    

    The latter two support word boundaries between any Unicode word characters.

    See the regex demo.

    If you want to only remove -<LF> between letters you may use

    gsub("([a-zA-Z])-\n([a-zA-Z])", "\\1\\2", x)
    gsub("(\\p{L})-\n(\\p{L})", "\\1\\2", x, perl=TRUE)
    stringr::str_replace_all(x, "(\\p{L})-\n(\\p{L})", "\\1\\2")
    

    If you need to only support lowercase letters, remove A-Z in the first gsub and replace \p{L} with \p{Ll} in the latter two.

    See this regex demo.