Search code examples
regexfluentd

how to use fluentd regexp when meet the nginx bad request


I use the fluentd to replace the logstash, I use in-tail plugin to tail the nginx access log, the access log's format is like:

log_format  main  '$remote_addr - $remote_user [$time_local] $request '
'"$status" $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" $request_time';

the fluentd conf is like

format /^(?<host>\S+)\s-\s(?<user>\S+)\s\[(?<time>[^\]]*)\]\s(?<method>\S+)\s(?<url>\S+)\s(?<http_version>\S+)\s"(?<status>[^\"]+)"\s(?<bytes>\d+)\s"(?<rfc>[^\"]+)"\s"(?<agent>[^\"]+)"\s"(?<x_forward>[^\"]+)"\s(?<time_spent>\S+).*$/

it works fine when the request is correct, but it meet error when the request is bad, just like following:

172.31.33.157 - - [08/May/2017:16:30:20 +0800] - "400" 0 "-" "-" "-" 0.000

the bad request miss the method and rfc field,so the fluentd runs wrong. how can I modify the format so that I don't care about whether the request is bad or correct?

any answers will be appreciated

run into another scenario, the agent or rfc filed is none, it runs error. just like

172.31.44.196 - - [08/May/2017:18:47:31 +0800] GET /click?mb_pl=ios&version=1.1 HTTP/1.1 "302" 5 "-" "" "100.38.38.149, 54.224.136.60" 0.004

or

172.31.44.196 - - [08/May/2017:18:47:31 +0800] GET /click?mb_pl=ios&version=1.1 HTTP/1.1 "302" 5 "" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Mobile/14E304" "100.38.38.149, 54.224.136.60" 0.004

how to solve this scenario?


Solution

  • You may wrap the parts of the pattern that are optional within optional non-capturing groups, (?:...)?:

    ^(?<host>\S+)\s-\s(?<user>\S+)\s\[(?<time>[^\]]*)\](?:\s(?<method>\S+))?(?:\s(?<url>\S+))?\s(?<http_version>\S+)\s"(?<status>[^\"]+)"\s(?<bytes>\d+)(?:\s"(?<rfc>[^\"]+)")?\s"(?<agent>[^\"]+)"\s"(?<x_forward>[^\"]+)"\s(?<time_spent>\S+).*$
    

    See the regex demo

    Here, I wrapped the following parts:

    (?:\s(?<method>\S+))?
    (?:\s(?<url>\S+))?
    (?:\s"(?<rfc>[^\"]+)")?
    

    That means, the whole subpattern sequences will be optiona, a whitespace and the named capture group patterns.

    Note: when you have more optional fields, you may find yourself in a situation when the pattern groups start matching unwanted parts of the input that belong to other groups. In that case, make sure you restrict the generic patterns and use optional patterns: replace + with * to match 0 or more occurrences rather than 1 or more, use optional groups as show above, and make sure you only match the characters/patterns that are expected.

    See an enhanced pattern below:

    ^(?<host>\S+)\s-\s(?<user>\S+)\s\[(?<time>[^\]]*)\](?:\s(?<method>\w+))?(?:\s(?<url>\/\S+))?\s(?<http_version>\S+)\s"(?<status>\d+)"\s(?<bytes>\d+)(?:\s"(?<rfc>[^\"]*)")?(?:\s"(?<agent>[^\"]*)")?\s"(?<x_forward>[^\"]*)"\s(?<time_spent>[\d.]+).*$
    

    See the regex demo.

    Some POIs here:

    • (?<method>\w+))? - here, we only match word chars (\S > \w, you may even consider using [A-Z])
    • (?:\s(?<url>\/\S+))? - added / since your URLs start with /
    • (?<status>\d+) - \S changed to \d (since the status code consists of digits only)
    • (?<rfc>[^\"]*)")? - the + is changed to * (the value can be empty)
    • (?:\s"(?<agent>[^\"]*)")? - same here as with rfc
    • \s"(?<x_forward>[^\"]*)" - same as above
    • (?<time_spent>[\d.]+ - the time_spent value only contains digits and dots.