I'm looking for a regular expression to select all punctuation except for that which is inside of a URL.
If I have the string:
This is a URL: https://test.com/ThisIsAURL !
And remove all matches it should become:
This is a URL https://test.com/ThisIsAURL
gsub("[[:punct:]]", "", x)
removes all punctuation including from URLs. I've tried using negative look behinds to select punctuation used after https but this was unsuccessful.
In the situation I need it for, all URLs are Twitter link-style URLs https://t.co/
. They do not end in .com
. Nor do they have more than one backslashed slug (/ThisIsAURL
). However, IDEALLY, I'd like the regex to be as versatile as possible, able to perform this operation successfully on any URL.
You may match and capture into Group 1 a URL-like pattern like https?://\S*
and then match any punctuation and replace with a backreference to Group 1 to restore the URL in the resulting string:
x <- "This is a URL: https://test.com/ThisIsAURL !"
trimws(gsub("(https?://\\S*)|[[:punct:]]+", "\\1", x, ignore.case=TRUE))
## => [1] "This is a URL https://test.com/ThisIsAURL"
See the R demo online.
The regex is
(https?://\S*)|[[:punct:]]+
See the regex demo.
Details
(https?://\S*)
- Group 1 (referenced to with \1
from the replacement pattern):
https?://
- https://
or http://
\S*
- 0+ non-whitespace chars|
- or[[:punct:]]+
- 1+ punctuation (proper punctuation, symbols and _
)