Search code examples
rlemmatization

Remove space after lemmatization


I simply lemmatized a character vector. The problem is that the lemmatization creates a space between words unified by a dash (eg. short-term becomes short - term). My character vector is full of these words, so I would like to find a way to remove this distortion.

Let me take an example:

text <- c("Stackoverflow is a great website where you can find great and very skilled people who are so kind to solve your coding problems. In the short-term is a very good thing because you can speed up your research, in the long-term is better if you learn how to code on your own. Let me add more non-sense to make my point. The growth-friendly composition of public finance is a good thing.")

ch_vector <- lemmatize_strings(text)

As I said before the outcome is this:

"Stackoverflow be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short - term** be a very good thing because you can speed up your research, in the **long - term** be good if you learn how to code on your own. Let me add much **non - sense** to make my point. The **growth - friendly** composition of public finance be a good thing."

Instead I want this:

"Stackoverflow be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short-term** be a very good thing because you can speed up your research, in the **long-term** be good if you learn how to code on your own. Let me add much **non-sense** to make my point. The **growth-friendly** composition of public finance be a good thing."

So far, I have done it in this way for each word of interest:

ch <- sub(pattern = "growth - friendly", replacement = "growth-friendly", x = ch_vector, fixed = TRUE)

But it is honestly time-consuming, inefficient and not always works fine (depending on capital letters, etc.)

Can you suggest a better way to do it?

Thanks a lot


Solution

  • x <- "Stackoverflow be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short - term** be a very good thing because you can speed up your research, in the **long - term** be good if you learn how to code on your own. Let me add much **non - sense** to make my point. The **growth - friendly** composition of public finance be a good thing."
    

    Using the function gsub() to substitute all dashes with surrounding spaces with a single dash seems like it might accomplish what you're after with minimal effort.

    gsub(" - ","-",x)
    
    # [1] "Stackoverflow be a great website where you can find great and very skill people
    # who be so kind to solve your code problem. In the **short-term** be a very good thing
    # because you can speed up your research, in the **long-term** be good if you learn how to
    # code on your own. Let me add much **non-sense** to make my point. The 
    # **growth-friendly** composition of public finance be a good thing."
    

    However, I'm not sure how this will interplay with the designed usage with the textstem package, so this may or may not meet your needs.