Search code examples
regexurldeep-linkingurl-scheme

Extract URL with and without custom URL schemes from text using RegEx


I am looking for a way to extract URLs from text using RegEx. There are plenty of questions and very good answers here on SF but i did not find a RegEx solution that is capable of extracting URLs whith custom schemes as well.

Here are a few examples where i need the URLs extracted from:

Text: Send me a message on whatsapp whatsapp://send?text=Hello+World. I will get in touch!
-> Should extract whatsapp://send?text=Hello+World

Text: Some text google.com
-> Should extract google.com

Text: There are many nice people on https://www.stackoverflow.com
-> Should extract https://www.stackoverflow.com

Text: You can send visit my Facebook profile on fb://myhappyprofile. 
-> Should extract fb://myhappyprofile

Text: https://www.google.com
-> Should extract https://www.google.com

The solutions i found so far explicitly extracted URLs starting with http:// https:// or ://. In those solutions, the protocols had to be specified within the expression.

The expression i got the most results from is the following expression:

(http|ftp|https|whatsapp|fb):\/\/([\w_-]+(?:(?:\.[\w_-]+)?))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?

Live Demo

Here i am listing the URL schemes ("deep links") to apps like WhatsApp and Facebook. Unfortunately, that does not scale very well.

Any help on this would be very appreciated!


Solution

  • If I'm reading this correctly, what you are wanting is to have a generic way to detect the protocol being used for the URL so you don't need to maintain a list of 100 different ones?

    If so, then replacing your protocol list with a standard character capture should do the job.

    Assuming that:

    • URLs will always contain "://" to separate the protocol from the location
    • a protocol will be a minimum of 2 characters and a maximum of 20 (though you can adjust that to suit your requirements

    That would mean that the following should do the job

    ([a-zA-Z]{2,20}):\/\/([\w_-]+(?:(?:\.[\w_-]+)?))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?
    

    https://regex101.com/r/epzXQv/2