Search code examples
regexlogstashgrok

Regex remove www from URL


I hope someone can help, this is driving me crazy!

I am attempting to modify Logstash Grok filters to parse a domain name. Currently the regex is: \b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b) and correctly separates the domain however, I need to add an additional check to remove www..

This is what I have come up with so far:

\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(^(?<!www$).*$?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)

I can only seem to keep the www. part of the domain, and not the domain itself. Example of what I need to achieve: www.stackoverflow.com should be stackoverflow.com.

I need to remove specifically www. and not the entire subdomain.

Thank you in advance!

UPDATE

Example inputs to expected outputs (using this post as an example): In it's current state: https://stackoverflow.com/questions/37070358/ returns www.stackoverflow.com

What I need is for it to return stackoverflow.com


Solution

  • You can add a (?!www\.) and (?!http:\/\/www\.) negative lookaheads right after the first \b to exclude matching www. or http://www.:

    \b(?!www\.)(?!http:\/\/www\.)(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(?:\.?|\b)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    

    See the regex demo

    You may add more negative lookaheads to exclude https:// or ftp/ftps links.

    ALTERNATIVE:

    \b(?!(?:https?|ftps?):\/\/)(?!www\.)(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(?:\.?|\b)
    

    See this regex demo

    The (?!(?:https?|ftps?):\/\/) and (?!www\.) lookaheads will just let you skip the protocol and www parts of the URLs.