I've been capturing web logs using logstash, and specifically I'm trying to capture web URLs, but also split them up.
If I take an example log entry URL:
"GET https://www.stackoverflow.com:443/some/link/here.html HTTP/1.1"
I use this grok pattern:
\"(?:%{NOTSPACE:http_method}|-)(?:%{SPACE}http://)?(?:%{SPACE}https://)?(%{NOTSPACE:http_site}:)?(?:%{NUMBER:http_site_port:int})?(?:%{GREEDYDATA:http_site_url})? (?:%{WORD:http_type|-}/)?(?:%{NOTSPACE:http_version:float})?(?:%{SPACE})?\"
I get this:
{
"http_method": [
[
"GET"
]
],
"SPACE": [
[
" ",
null,
""
]
],
"http_site": [
[
"www.stackoverflow.com"
]
],
"BASE10NUM": [
[
"443"
]
],
"http_site_url": [
[
"/some/link/here.html"
]
],
"http_type": [
[
"HTTP"
]
]
}
The trouble is, I'm trying to ALSO capture the entire URL:
https://www.stackoverflow.com:443/some/link/here.html
So in total, I'm seeking 4 separate outputs:
http_site_complete
https://www.stackoverflow.com:443/some/link/here.html
http_site
www.stackoverflow.com
http_site_port
443
http_site_url
/some/link/here.html
Is there some way to do this?
First, look at the built-in patterns for dealing with URLs. Putting something like URIHOST in your pattern will be easier to read and maintain that a bunch od WORDs or NOTSPACEs.
Second, once you have lots of little fields, you can always use logstash's filters to manipulate them. You could use:
mutate {
add_field => { "http_site_complete", "%{http_site}:%{http_site_port}%{http_site_url}" }
}
}
Or you could get fancy with your regexp and use a named group:
(?<total>%{WORD:wordOne} %{WORD:wordTwo} %{WORD:wordThree})
which would individually capture three fields and make one more field from the whole string.