I simply lemmatized a character vector. The problem is that the lemmatization creates a space between words unified by a dash (eg. short-term becomes short - term). My character vector is full of these words, so I would like to find a way to remove this distortion.
Let me take an example:
text <- c("Stackoverflow is a great website where you can find great and very skilled people who are so kind to solve your coding problems. In the short-term is a very good thing because you can speed up your research, in the long-term is better if you learn how to code on your own. Let me add more non-sense to make my point. The growth-friendly composition of public finance is a good thing.")
ch_vector <- lemmatize_strings(text)
As I said before the outcome is this:
"Stackoverflow be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short - term** be a very good thing because you can speed up your research, in the **long - term** be good if you learn how to code on your own. Let me add much **non - sense** to make my point. The **growth - friendly** composition of public finance be a good thing."
Instead I want this:
"Stackoverflow be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short-term** be a very good thing because you can speed up your research, in the **long-term** be good if you learn how to code on your own. Let me add much **non-sense** to make my point. The **growth-friendly** composition of public finance be a good thing."
So far, I have done it in this way for each word of interest:
ch <- sub(pattern = "growth - friendly", replacement = "growth-friendly", x = ch_vector, fixed = TRUE)
But it is honestly time-consuming, inefficient and not always works fine (depending on capital letters, etc.)
Can you suggest a better way to do it?
Thanks a lot
x <- "Stackoverflow be a great website where you can find great and very skill people who be so kind to solve your code problem. In the **short - term** be a very good thing because you can speed up your research, in the **long - term** be good if you learn how to code on your own. Let me add much **non - sense** to make my point. The **growth - friendly** composition of public finance be a good thing."
Using the function gsub()
to substitute all dashes with surrounding spaces with a single dash seems like it might accomplish what you're after with minimal effort.
gsub(" - ","-",x)
# [1] "Stackoverflow be a great website where you can find great and very skill people
# who be so kind to solve your code problem. In the **short-term** be a very good thing
# because you can speed up your research, in the **long-term** be good if you learn how to
# code on your own. Let me add much **non-sense** to make my point. The
# **growth-friendly** composition of public finance be a good thing."
However, I'm not sure how this will interplay with the designed usage with the textstem
package, so this may or may not meet your needs.