Search code examples
rregextext-mining

Regular expression to match all punctuation except that inside of a URL


I'm looking for a regular expression to select all punctuation except for that which is inside of a URL.

If I have the string:

This is a URL: https://test.com/ThisIsAURL !

And remove all matches it should become:

This is a URL https://test.com/ThisIsAURL

gsub("[[:punct:]]", "", x) removes all punctuation including from URLs. I've tried using negative look behinds to select punctuation used after https but this was unsuccessful.

In the situation I need it for, all URLs are Twitter link-style URLs https://t.co/. They do not end in .com. Nor do they have more than one backslashed slug (/ThisIsAURL). However, IDEALLY, I'd like the regex to be as versatile as possible, able to perform this operation successfully on any URL.


Solution

  • You may match and capture into Group 1 a URL-like pattern like https?://\S* and then match any punctuation and replace with a backreference to Group 1 to restore the URL in the resulting string:

    x <- "This is a URL: https://test.com/ThisIsAURL !"
    trimws(gsub("(https?://\\S*)|[[:punct:]]+", "\\1", x, ignore.case=TRUE))
    ## => [1] "This is a URL https://test.com/ThisIsAURL"
    

    See the R demo online.

    The regex is

    (https?://\S*)|[[:punct:]]+
    

    See the regex demo.

    Details

    • (https?://\S*) - Group 1 (referenced to with \1 from the replacement pattern):
      • https?:// - https:// or http://
      • \S* - 0+ non-whitespace chars
    • | - or
    • [[:punct:]]+ - 1+ punctuation (proper punctuation, symbols and _)