Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"?
I used the following function to clean my text
clean.text = function(x)
{
# remove rt
x = gsub("rt ", "", x)
# remove at
x = gsub("@\\w+", "", x)
x = gsub("[[:punct:]]", "", x)
x = gsub("[[:digit:]]", "", x)
# remove http
x = gsub("http\\w+", "", x)
x = gsub("[ |\t]{2,}", "", x)
x = gsub("^ ", "", x)
x = gsub(" $", "", x)
x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
#return(x)
}
and apply it on hyphenated expressions that returned
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"
while my desired output is
"accident-prone"
I have referenced this thread but didn't find it worked on my situation. There must be some regex things that I haven't figured out. It will be really appreciated if someone could enlighten me on this.
Putting my two cents in, you could use (*SKIP)(*FAIL)
with perl = TRUE
and remove any non-word characters:
data <- c("my-test of #$%^&*", "accident-prone")
(gsub("(?<![^\\w])[- ](?=\\w)(*SKIP)(*FAIL)|\\W+", "", data, perl = TRUE))
Resulting in
[1] "my-test of" "accident-prone"
(?<![^\\w])[- ](?=\\w)
# a whitespace or a dash between two word characters
# or at the very beginning of the string
let these fail with (*SKIP)(*FAIL)
and put what you want to be removed on the right side of the alternation, in this case
\W+
effectively removing any non-word-characters not between word characters.
You'd need to provide more examples for testing though.