I hope someone can help, this is driving me crazy!
I am attempting to modify Logstash Grok filters to parse a domain name.
Currently the regex is:
\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)
and correctly separates the domain however, I need to add an additional check to remove www.
.
This is what I have come up with so far:
\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(^(?<!www$).*$?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(\.?|\b)
I can only seem to keep the www.
part of the domain, and not the domain itself.
Example of what I need to achieve:
www.stackoverflow.com
should be stackoverflow.com
.
I need to remove specifically www.
and not the entire subdomain.
Thank you in advance!
UPDATE
Example inputs to expected outputs (using this post as an example):
In it's current state:
https://stackoverflow.com/questions/37070358/
returns www.stackoverflow.com
What I need is for it to return stackoverflow.com
You can add a (?!www\.)
and (?!http:\/\/www\.)
negative lookaheads right after the first \b
to exclude matching www.
or http://www.
:
\b(?!www\.)(?!http:\/\/www\.)(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(?:\.?|\b)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo
You may add more negative lookaheads to exclude https://
or ftp
/ftps
links.
ALTERNATIVE:
\b(?!(?:https?|ftps?):\/\/)(?!www\.)(?:[0-9A-Za-z][0-9A-Za-z-]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-]{0,62}))*(?:\.?|\b)
See this regex demo
The (?!(?:https?|ftps?):\/\/)
and (?!www\.)
lookaheads will just let you skip the protocol and www
parts of the URLs.