Search code examples
regexurlhrefpcre

RegEx - phishing attempts in HTML


I need ur help :(
What I want:
Match string if url.text AND url.href both contains URL, which are not equal (without protocol and subdomains).

It should work like this:

<a href="http://www.test1.net/dir1/index.html" target="_blank">test1.net/admin</a> <-- NOT MATCH
<a href="https://test2.com">THIS SITE</a> <-- NOT MATCH
<a href="https://subdomain.test3.org">test2.org</a> <-- MATCH
<a href="http://www2.test4.com" target="_blank">https://global.test4.com/index.html</a> <-- NOT MATCH
<a href="http://eu.test5.com">https://evil.com/eu.test5.com/</a> <-- MATCH
<a href="http://eu.site6.com/index.html" target="_blank">https: // eu. evil. com</a> <-- MATCH
<a href="https://site7.com/">http://www.site7.com/123/test</a> <-- NOT MATCH

I started write something like this, but I had a problem with my code doing the opposite.
Help me figure out how to make what I want.


Solution

  • Your original expression is pretty well-designed, yet I would have used some statements such as:

    (?!.*\1.*)
    

    or:

    (?!((?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?(\1)).*)
    

    within, to bypass the same domain in the url.text, maybe with some expression similar to:

    (?i)<a\s+href="(?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?([a-z0-9_-]+\.[a-z0-9_-]{2,6})(\/[^"]*)?"[^>]*>(?!.*\1.*)(?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?([a-z0-9_-]+\.[a-z0-9_-]{2,6})(\/[^"]*)?.*?<\/a>
    

    or probably and more accurately with:

    (?i)<a\s+href="(?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?([a-z0-9_-]+\.[a-z0-9_-]{2,6})(\/[^"]*)?"[^>]*>(?!((?:https?:\/\/)?(?:w{3}\.)?(?:[^"\/]*\.)?(\1)).*)(?:https?:\s*\/\/\s*)?(?:\s*w{3}\.\s*)?(?:[^"\/]*\.\s*)?([a-z0-9_-]+\s*\.\s*[a-z0-9_-]{2,6}\s*)(\/[^"]*)?.*?<\/a>
    

    which you'd most likely want to modify, and change the boundaries. For instance, you can add \s* anywhere you'd want to allow some spaces, or maybe with a double-bounded quantifier \s{0,5}.

    Demo


    If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.