Search code examples
rregexstemmingpunctuationhyphenation

Remove punctuation but keep hyphenated phrases in R text cleaning


Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"?

I used the following function to clean my text

clean.text = function(x)
{
  # remove rt
  x = gsub("rt ", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  x = gsub("[[:punct:]]", "", x)
  x = gsub("[[:digit:]]", "", x)
  # remove http
  x = gsub("http\\w+", "", x)
  x = gsub("[ |\t]{2,}", "", x)
  x = gsub("^ ", "", x)
  x = gsub(" $", "", x)
  x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
  #return(x)
}

and apply it on hyphenated expressions that returned

my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"

while my desired output is

"accident-prone"

I have referenced this thread but didn't find it worked on my situation. There must be some regex things that I haven't figured out. It will be really appreciated if someone could enlighten me on this.


Solution

  • Putting my two cents in, you could use (*SKIP)(*FAIL) with perl = TRUE and remove any non-word characters:

    data <- c("my-test of #$%^&*", "accident-prone")
    (gsub("(?<![^\\w])[- ](?=\\w)(*SKIP)(*FAIL)|\\W+", "", data, perl = TRUE))
    

    Resulting in

    [1] "my-test of"     "accident-prone"
    

    See a demo on regex101.com.


    Here the idea is to match what you want to keep

    (?<![^\\w])[- ](?=\\w)
    # a whitespace or a dash between two word characters
    # or at the very beginning of the string
    

    let these fail with (*SKIP)(*FAIL) and put what you want to be removed on the right side of the alternation, in this case

    \W+
    

    effectively removing any non-word-characters not between word characters.
    You'd need to provide more examples for testing though.