Search code examples
regexgourlregex-group

Url regex tweak to just capture url not ip


I have made this regex to capture all types of url (it literally capture all url) but it also captures single ip.

This is my scenario: I have a list full of IP, Hash and url and my url regex and ip regex both capture the same entry. I don't know if a single ip can be considered as "url".

My regex: ((http|https)://)?(www)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,9}\b([-a-zA-Z0-9()@:%_\|+.~#?&//={};,\[\]'"$\x60]*)?

Captures all these:

http://127.0.0.1/
http://127.0.0.1
https://127.0.0.1/m=weblogin/loginform238,363,771,89816356,2167
127.0.0.1:8080 ------> excluding this one is okay too (optional)
127.0.0.1 ------> i want to exclude this one
google.com
google.com:80
www.google.com
https://google.com
https://www.google.com

I want my regex to capture all url's except single ip's like this:

127.0.0.1
  • Note: I want to use this in golang code (using golang regex engine)
  • Note: I am using regexp.Compile() and FindAllString functions.

try this regex on regex101


Solution

  • You can use a regex implementing the "best trick ever" with FindAllStringSubmatch: match what you need to skip/omit, and match and capture what you need to keep.

    \b(?:https?://)?(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b(?:[^:]|$)|((?:https?://)?(?:www)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,9}\b[-a-zA-Z0-9()@:%_\|+.~#?&//={};,\[\]'"$\x60]*)
    

    The first alternative is an IP matching regex where I added (?:https?://)? part to match an optional protocol part and (?:[^:]|$) part to make sure there is a char other than : or end of string immediately after the IP pattern, but you may further adjust this part.

    Then, use it in Go like

    package main
    
    import (
        "fmt"
        "regexp"
    )
    
    func main() {
        r := regexp.MustCompile(`\b(?:https?://)?(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b(?:[^:]|$)|((?:https?://)?(?:www)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,9}\b[-a-zA-Z0-9()@:%_\|+.~#?&//={};,\[\]'"$\x60]*)`)
        matches := r.FindAllStringSubmatch(`http://127.0.0.1/
    http://127.0.0.1
    http://www.127.0.0.1/m=weblogin/loginform238,363,771,89816356,2167
    127.0.0.1:8080
    127.0.0.1
    google.com
    google.com:80
    www.google.com
    https://google.com
    https://www.google.com`, -1)
            for _, v := range matches {
                if (len(v[1]) > 0) {       // if Group 1 matched
                fmt.Println(v[1])          // Display it, else do nothing
            }
        }   
    }
    

    Output:

    http://www.127.0.0.1/m=weblogin/loginform238,363,771,89816356,2167
    127.0.0.1:8080
    google.com
    google.com:80
    www.google.com
    https://google.com
    https://www.google.com