Search code examples
rtexttopic-modelingquanteda

Is there an algorithm for removing a dash ("-") between two words and then contracting them?


I have a lot of text of words with dashes between new lines like so:

vec <- "Today is a good day because the sun is shin- ing."

What I want is instead:

"Today is a good day because the sun is shining."

But I don't want it just for specific words but for all words that are being "broken up" like that. It seems like something you should be able to do in Word format, but I haven't been able to figure out how, so maybe it's more complicated.

For the record, I am using readtext/quanteda package, but I can't find anything there either that can do this by default at least.

Is there some simple way to do this?


Solution

  • Here is one way. We can use str_replace_all from the stringr package.

    vec <- "Today is a good day because the sun is shin- ing."
    
    library(stringr)
    
    vec2 <- str_replace_all(vec, "-\\s+", "")
    
    vec2
    # [1] "Today is a good day because the sun is shining."