Search code examples
rregexstemmingpunctuationhyphenation

Removing hyphens in http but preserving hyphenated words in corpus


I am trying to modify a stemming function that is able to 1) remove hyphens in http (that appeared in the corpus) but, meanwhile, 2) preserve hyphens that appeared in meaningful hyphenated expressions (e.g., time-consuming, cost-prohibitive, etc.). I actually asked similar questions a few months ago on a different question thread, the code looks like this:

# load stringr to use str_replace_all
require(stringr)

clean.text = function(x)
{
  # remove rt
  x = gsub("rt ", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  x = gsub("[[:punct:]]", "", x)
  x = gsub("[[:digit:]]", "", x)
  # remove http
  x = gsub("http\\w+", "", x)
  x = gsub("[ |\t]{2,}", "", x)
  x = gsub("^ ", "", x)
  x = gsub(" $", "", x)
  x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
  #return(x)
}

# example
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"

but could not get satisfactory answer, I then shifted my attention to other projects until resuming to work on this. It appears that the "[^[:alnum:][:space:]'-]" in the last line of the code block is the culprit that also removed - from the non-http part of corpus.

I could not figure out how to achieve our desired outputs, it will be very appreciated if someone could offer their insights on this.


Solution

  • The actual culprit is the [[:punct:]] removing pattern as it matches - anywhere in the string.

    You may use

    clean.text <- function(x)
    {
      # remove rt
      x <- gsub("rt\\s", "", x)
      # remove at
      x <- gsub("@\\w+", "", x)
      x <- gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE)
      x <- gsub("[[:digit:]]+", "", x)
      # remove http
      x <- gsub("http\\w+", "", x)
      x <- gsub("\\h{2,}", "", x, perl=TRUE)
      x <- trimws(x)
      x <- gsub("[^[:alnum:][:space:]'-]", " ", x)
      return(x)
    }
    

    Then,

    my_text <- "  accident-prone  http://www.some.com  rt "
    new_text <- clean.text(my_text)
    new_text 
    ## => [1] "accident-prone"
    

    See the R demo.

    Note:

    • x = gsub("^ ", "", x) and x = gsub(" $", "", x) can be replaced with trimws(x)
    • gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE) removes any punctuation BUT hyphens in between word chars (you may adjust this further in the part before (*SKIP)(*F))
    • gsub("[^[:alnum:][:space:]'-]", " ", x) is a base R equivalent for str_replace_all(x, "[^[:alnum:][:space:]'-]", " ").
    • gsub("\\h{2,}", "", x, perl=TRUE) remove any 2 or more horizontal whitespaces. If by "[ |\t]{2,}" you meant to match any 2 or more whitespaces, use \\s instead of \\h here.