Search code examples
javaregexurl-validation

how can I validate URL(domain) allowing wildcard(*, %) in java


I want to check validate URL allowing wildcard in java.

I found some nice examples about validating URL in java (REGEX, urlValidator), but those aren't providing the wildcard character.

Here's what I'm practicing:

CODE(urlValidator)

public void urlValidiTest(){
    System.out.println(this.urlCheck("https://www.google.com"));
    System.out.println(this.urlCheck("https://google.com"));
    System.out.println(this.urlCheck("*.com"));
}

public boolean urlCheck(String url){
    return new UrlValidator().isValid(url);
}

OUTPUT

true

true

false

CODE(regex)

public void regexTest() {
  String[] URLs = new String[] { "http://www.google.com", "http://google.com/","*.com" };
    Pattern REGEX = Pattern.compile("(?i)^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?$");
    for (String url : URLs) {
        Matcher matcher = REGEX.matcher(url);
        if (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

RESULT

http://www.google.com

http://google.com/

What I want to do is all the above URL is valid.

How should I approach this problem to solve?

Any comment would be appreciated. Thanks.

UPDATES

I got rid of the scheme part and added |* and |\.* to the domain part following the answer(|* and |.* gives me a error - invalid escape sequence(valid ones are \b \t \n \f \r \" \' ) - but I'm not sure the changes are right).

Now it doesn't allow "google.com"; but allow others("www.google.com", "google.com", ".google.com", ".com")

 public void regexValidator(String str){

    Pattern REGEX = Pattern.compile(""
            + "(?i)^(?:\\S+(?::\\S*)?@)"
            + "?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)"
            + "(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])"
            + "(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|"

            //DOMAIN
            + "(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+|\\*)"
            + "(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*"
            //

            + "(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)"
            + "(?::\\d{2,5})?(?:[/?#]\\S*)?$");

    Matcher _matcher = REGEX.matcher(str);
    if(_matcher.find()){
        System.out.println("[O] " + str);
    }
    else {
        System.out.println("[X]" + str);
    }
}

public void validate(){
    System.out.println("TEST START");
    this.regexValidator("https://www.google.com");
    this.regexValidator("www.google.com");
    this.regexValidator("google.com");
    this.regexValidator("*.google.com");
    this.regexValidator("*.com");
    System.out.println("DONE");
}

TEST START

[X] https://www.google.com

[O] www.google.com

[O] google.com

[O] *.google.com

[O] *.com

DONE

Need any help. Thanks.


Solution

  • Take this with a grain of salt, I don't have access to Java right now and did this from the top of my head, so if there are errors in here, feel free to correct me.

    You need to update your regex to include wildcards. That's not trivial, considering how complex that thing is.

    Let's first break down the regex you have:

    (?i)
    ^
        (?:
            (?:
                https?|ftp
            )
            ://
        )
        (?:
            \S+
            (?:
                :\S*
            )?
            @
        )?
        (?:
            (?!
                (?:
                    10|127
                )
                (?:
                    \.\d{1,3}
                ){3}
            )
            (?!
                (?:
                    169\.254|192\.168
                )
                (?:
                    \.\d{1,3}
                ){2}
            )
            (?!
                172\.
                (?:
                    1[6-9]|2\d|3[0-1]
                )
                (?:
                    \.\d{1,3}
                ){2}
            )
            (?:
                [1-9]\d?|1\d\d|2[01]\d|22[0-3]
            )
            (?:
                \.
                (?:
                    1?\d{1,2}|2[0-4]\d|25[0-5]
                )
            ){2}
            (?:
                \.
                (?:
                    [1-9]\d?|1\d\d|2[0-4]\d|25[0-4]
                )
            )
            |
            (?:
                (?:
                    [a-z\u00a1-\uffff0-9]-*
                )*
                [a-z\u00a1-\uffff0-9]+
            )
            (?:
                \.
                (?:
                    [a-z\u00a1-\uffff0-9]-*
                )*
                [a-z\u00a1-\uffff0-9]+
            )*
            (?:
                \.
                (?:
                    [a-z\u00a1-\uffff]{2,}
                )
            )
            \.?
        )
        (?:
            :\d{2,5}
        )?
        (?:
            [/?#]\S*
        )?
    $
    

    We can now see that there are groups for the scheme, username/password pair (the group with the @ character), a big group for the domain itself, and a group for a port and one for possible path, query or fragment parts. The big group can be broken down into two parts (separated by | (OR)), the first is for IP addresses, with negative look aheads to disallow local IPs, and the latter is for named domains, consisting of one or more parts separated by a dot and finally the TLD.

    So what do you need to do to allow wildcards? Add a wildcard character (* or %) in each group that you want to allow to be replaced by a wildcard:

    If you want to allow a wildcard for the scheme, add one here:

        (?:
            (?:
                https?|ftp
                |\*    <-----
            )
            ://
        )
    

    If you want to allow wildcards for the username and/or password parts, you don't need to do anything, your regex already allows any non-whitespace characters, so *:*@ or *@ are already valid.

    If you want to allow wildcards for the domain name, add them here:

            (?:
                (?:
                    [a-z\u00a1-\uffff0-9]-*
                )*
                [a-z\u00a1-\uffff0-9]+
                |\*    <-----
            )
            (?:
                \.
                (?:
                    [a-z\u00a1-\uffff0-9]-*
                )*
                [a-z\u00a1-\uffff0-9]+
                |\.\*    <-----
            )*
    

    If you want to allow a wildcard for the TLD, add one here:

            (?:
                \.
                (?:
                    [a-z\u00a1-\uffff]{2,}
                    |\*    <-----
                )
            )
    

    If you want to allow a wildcard for the port, add one here:

        (?:
            :\d{2,5}
            |:\*    <-----
        )?
    

    If you want to allow a wildcard for paths, you don't need to do anything, already covered by your regex (/* and /*/*/foobar etc. are already valid).

    And last, but not least, if you want to allow wildcards for scheme and domain name together (like in your example), you need to add a new group and OR it in:

        |
        (?:
            \*
            \.
            (?:
                [a-z\u00a1-\uffff]{2,}
            )
        )
        (?:
            :\d{2,5}
        )?
        (?:
            [/?#]\S*
        )?
    

    Basically just add that behind the last group and before the $ symbol. Don't forget to add a wildcard to the TLD and/or the port here as well, if you want that.