I am looking for a way to extract URLs from text using RegEx. There are plenty of questions and very good answers here on SF but i did not find a RegEx solution that is capable of extracting URLs whith custom schemes as well.
Here are a few examples where i need the URLs extracted from:
Text: Send me a message on whatsapp whatsapp://send?text=Hello+World. I will get in touch!
-> Should extract whatsapp://send?text=Hello+World
Text: Some text google.com
-> Should extract google.com
Text: There are many nice people on https://www.stackoverflow.com
-> Should extract https://www.stackoverflow.com
Text: You can send visit my Facebook profile on fb://myhappyprofile.
-> Should extract fb://myhappyprofile
Text: https://www.google.com
-> Should extract https://www.google.com
The solutions i found so far explicitly extracted URLs starting with http:// https:// or ://. In those solutions, the protocols had to be specified within the expression.
The expression i got the most results from is the following expression:
(http|ftp|https|whatsapp|fb):\/\/([\w_-]+(?:(?:\.[\w_-]+)?))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
Here i am listing the URL schemes ("deep links") to apps like WhatsApp and Facebook. Unfortunately, that does not scale very well.
Any help on this would be very appreciated!
If I'm reading this correctly, what you are wanting is to have a generic way to detect the protocol being used for the URL so you don't need to maintain a list of 100 different ones?
If so, then replacing your protocol list with a standard character capture should do the job.
Assuming that:
That would mean that the following should do the job
([a-zA-Z]{2,20}):\/\/([\w_-]+(?:(?:\.[\w_-]+)?))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?