I use the fluentd
to replace the logstash
, I use in-tail
plugin to tail the nginx access
log, the access log's format is like:
log_format main '$remote_addr - $remote_user [$time_local] $request '
'"$status" $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" $request_time';
the fluentd
conf is like
format /^(?<host>\S+)\s-\s(?<user>\S+)\s\[(?<time>[^\]]*)\]\s(?<method>\S+)\s(?<url>\S+)\s(?<http_version>\S+)\s"(?<status>[^\"]+)"\s(?<bytes>\d+)\s"(?<rfc>[^\"]+)"\s"(?<agent>[^\"]+)"\s"(?<x_forward>[^\"]+)"\s(?<time_spent>\S+).*$/
it works fine when the request is correct, but it meet error when the request is bad, just like following:
172.31.33.157 - - [08/May/2017:16:30:20 +0800] - "400" 0 "-" "-" "-" 0.000
the bad request miss the method
and rfc
field,so the fluentd
runs wrong. how can I modify the format
so that I don't care about whether the request is bad or correct?
any answers will be appreciated
run into another scenario, the agent
or rfc
filed is none, it runs error. just like
172.31.44.196 - - [08/May/2017:18:47:31 +0800] GET /click?mb_pl=ios&version=1.1 HTTP/1.1 "302" 5 "-" "" "100.38.38.149, 54.224.136.60" 0.004
or
172.31.44.196 - - [08/May/2017:18:47:31 +0800] GET /click?mb_pl=ios&version=1.1 HTTP/1.1 "302" 5 "" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Mobile/14E304" "100.38.38.149, 54.224.136.60" 0.004
how to solve this scenario?
You may wrap the parts of the pattern that are optional within optional non-capturing groups, (?:...)?
:
^(?<host>\S+)\s-\s(?<user>\S+)\s\[(?<time>[^\]]*)\](?:\s(?<method>\S+))?(?:\s(?<url>\S+))?\s(?<http_version>\S+)\s"(?<status>[^\"]+)"\s(?<bytes>\d+)(?:\s"(?<rfc>[^\"]+)")?\s"(?<agent>[^\"]+)"\s"(?<x_forward>[^\"]+)"\s(?<time_spent>\S+).*$
See the regex demo
Here, I wrapped the following parts:
(?:\s(?<method>\S+))?
(?:\s(?<url>\S+))?
(?:\s"(?<rfc>[^\"]+)")?
That means, the whole subpattern sequences will be optiona, a whitespace and the named capture group patterns.
Note: when you have more optional fields, you may find yourself in a situation when the pattern groups start matching unwanted parts of the input that belong to other groups. In that case, make sure you restrict the generic patterns and use optional patterns: replace +
with *
to match 0 or more occurrences rather than 1 or more, use optional groups as show above, and make sure you only match the characters/patterns that are expected.
See an enhanced pattern below:
^(?<host>\S+)\s-\s(?<user>\S+)\s\[(?<time>[^\]]*)\](?:\s(?<method>\w+))?(?:\s(?<url>\/\S+))?\s(?<http_version>\S+)\s"(?<status>\d+)"\s(?<bytes>\d+)(?:\s"(?<rfc>[^\"]*)")?(?:\s"(?<agent>[^\"]*)")?\s"(?<x_forward>[^\"]*)"\s(?<time_spent>[\d.]+).*$
See the regex demo.
Some POIs here:
(?<method>\w+))?
- here, we only match word chars (\S
> \w
, you may even consider using [A-Z]
)(?:\s(?<url>\/\S+))?
- added /
since your URLs start with /
(?<status>\d+)
- \S
changed to \d
(since the status code consists of digits only) (?<rfc>[^\"]*)")?
- the +
is changed to *
(the value can be empty)(?:\s"(?<agent>[^\"]*)")?
- same here as with rfc
\s"(?<x_forward>[^\"]*)"
- same as above(?<time_spent>[\d.]+
- the time_spent value only contains digits and dots.