Regular expression to match all punctuation except that inside of a URL

I'm looking for a regular expression to select all punctuation except for that which is inside of a URL.

If I have the string:

This is a URL: https://test.com/ThisIsAURL !

And remove all matches it should become:

This is a URL https://test.com/ThisIsAURL

gsub("[[:punct:]]", "", x) removes all punctuation including from URLs. I've tried using negative look behinds to select punctuation used after https but this was unsuccessful.

In the situation I need it for, all URLs are Twitter link-style URLs https://t.co/. They do not end in .com. Nor do they have more than one backslashed slug (/ThisIsAURL). However, IDEALLY, I'd like the regex to be as versatile as possible, able to perform this operation successfully on any URL.

Solution

You may match and capture into Group 1 a URL-like pattern like https?://\S* and then match any punctuation and replace with a backreference to Group 1 to restore the URL in the resulting string:

x <- "This is a URL: https://test.com/ThisIsAURL !"
trimws(gsub("(https?://\\S*)|[[:punct:]]+", "\\1", x, ignore.case=TRUE))
## => [1] "This is a URL https://test.com/ThisIsAURL"

See the R demo online.

The regex is

(https?://\S*)|[[:punct:]]+

See the regex demo.

Details

(https?://\S*) - Group 1 (referenced to with \1 from the replacement pattern):
- https?:// - https:// or http://
- \S* - 0+ non-whitespace chars
| - or
[[:punct:]]+ - 1+ punctuation (proper punctuation, symbols and _)