I have text I am cleaning up in R. I want to use stringi, but am happy to use other packages.
Some of the words are broken over two lines. So I get a sub-string "halfword-\nsecondhalfword".
I also have strings that are just "----\nword" and " -\n" (and some others that I do not want to replace.
What I want to do is identify all sub-strings "[a-z]-\n" and then keep the generic letter [a,z], but remove the -\n characters.
I do not want to remove all -\n , and I do not want to remove the letter [a-z].
Thanks!
You may make use of word boundaries to match -<LF>
only in between word characters:
gsub("\\b-\n\\b", "", x)
gsub("(*UCP)\\b-\n\\b", "", x, perl=TRUE)
stringr::str_replace_all(x, "\\b-\n\\b", "", x)
The latter two support word boundaries between any Unicode word characters.
See the regex demo.
If you want to only remove -<LF>
between letters you may use
gsub("([a-zA-Z])-\n([a-zA-Z])", "\\1\\2", x)
gsub("(\\p{L})-\n(\\p{L})", "\\1\\2", x, perl=TRUE)
stringr::str_replace_all(x, "(\\p{L})-\n(\\p{L})", "\\1\\2")
If you need to only support lowercase letters, remove A-Z
in the first gsub
and replace \p{L}
with \p{Ll}
in the latter two.
See this regex demo.