I am trying to modify a stemming function that is able to 1) remove hyphens in http (that appeared in the corpus) but, meanwhile, 2) preserve hyphens that appeared in meaningful hyphenated expressions (e.g., time-consuming, cost-prohibitive, etc.). I actually asked similar questions a few months ago on a different question thread, the code looks like this:
# load stringr to use str_replace_all
require(stringr)
clean.text = function(x)
{
# remove rt
x = gsub("rt ", "", x)
# remove at
x = gsub("@\\w+", "", x)
x = gsub("[[:punct:]]", "", x)
x = gsub("[[:digit:]]", "", x)
# remove http
x = gsub("http\\w+", "", x)
x = gsub("[ |\t]{2,}", "", x)
x = gsub("^ ", "", x)
x = gsub(" $", "", x)
x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
#return(x)
}
# example
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"
but could not get satisfactory answer, I then shifted my attention to other projects until resuming to work on this. It appears that the "[^[:alnum:][:space:]'-]"
in the last line of the code block is the culprit that also removed -
from the non-http part of corpus.
I could not figure out how to achieve our desired outputs, it will be very appreciated if someone could offer their insights on this.
The actual culprit is the [[:punct:]]
removing pattern as it matches -
anywhere in the string.
You may use
clean.text <- function(x)
{
# remove rt
x <- gsub("rt\\s", "", x)
# remove at
x <- gsub("@\\w+", "", x)
x <- gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE)
x <- gsub("[[:digit:]]+", "", x)
# remove http
x <- gsub("http\\w+", "", x)
x <- gsub("\\h{2,}", "", x, perl=TRUE)
x <- trimws(x)
x <- gsub("[^[:alnum:][:space:]'-]", " ", x)
return(x)
}
Then,
my_text <- " accident-prone http://www.some.com rt "
new_text <- clean.text(my_text)
new_text
## => [1] "accident-prone"
See the R demo.
Note:
x = gsub("^ ", "", x)
and x = gsub(" $", "", x)
can be replaced with trimws(x)
gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE)
removes any punctuation BUT hyphens in between word chars (you may adjust this further in the part before (*SKIP)(*F)
)gsub("[^[:alnum:][:space:]'-]", " ", x)
is a base R equivalent for str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
.gsub("\\h{2,}", "", x, perl=TRUE)
remove any 2 or more horizontal whitespaces. If by "[ |\t]{2,}"
you meant to match any 2 or more whitespaces, use \\s
instead of \\h
here.