Search code examples
logstashfilebeat

How to identify the bottleneck in slow Filebeat ingestion


I'm using Filebeat to monitor a single rapidly-rolling log file. At busy system times, the file rolls every few seconds. The logging configuration is set to roll once it gets to 10mb (Between 15k and 35k lines). I have a close_timeout of 5m set to prevent too many open file handles being kept if the logs roll too rapidly as this can cause my disk to fill up as the rolled files cannot be deleted.

My issue is that log lines are clearly being missed (Kibana shows periods with no data from this log, even though I can see activity in the system) and it is also taking a long time for current data to appear in Elasticsearch (up to 15 minutes).

This is my Filebeat configuration:

- type: log
  enabled: true
  fields:
    company.app: "myapp"
    myapp.logstyle: "myappsql"
  paths:
    # Match paths from any environment this filebeat is running on.
    # There is only a single environment and single file being monitored per filebeat process
    - /apps/myapp/*/runtime/logs/dataserver/myfile.log
  close_inactive: 1m
  close_timeout: 5m
  scan_frequency: 1s

output.logstash:
  hosts: [ "mylogstashhost:5046" ]
  compression_level: 3
  bulk_max_size: 4096
  loadbalance: true
  worker: 10

I see this message in the filebeat log appearing every few seconds:

Closing harvester because close_timeout was reached: /myfile.log

I can also see that there are often many 'open files' and running harvesters even though I'm monitoring just this file:

{
    "monitoring": {
        "metrics": {
            "beat":{
                "cpu":{
                    "system":{
                        "ticks":14660,"time":{
                            "ms":20
                        }
                    },"total":{
                        "ticks":84710,"time":{
                            "ms":38
                        },"value":84710
                    },"user":{
                        "ticks":70050,"time":{
                            "ms":18
                        }
                    }
                },"handles":{
                    "limit":{
                        "hard":1048576,"soft":1048576
                    },"open":38
                },"info":{
                    "ephemeral_id":"c885fa62-01b5-4d49-9a80-6acf89058539","uptime":{
                        "ms":1410214
                    }
                },"memstats":{
                    "gc_next":52842240,"memory_alloc":30249872,"memory_total":6000680424
                },"runtime":{
                    "goroutines":282
                }
            },"filebeat":{
                "harvester":{
                    "open_files":15,"running":34
                }
            },"libbeat":{
                "config":{
                    "module":{
                        "running":0
                    }
                },"output":{
                    "read":{
                        "bytes":72
                    }
                },"pipeline":{
                    "clients":2,"events":{
                        "active":4118
                    }
                }
            },"registrar":{
                "states":{
                    "current":1458
                }
            },"system":{
                "load":{
                    "1":0.01,"15":0.12,"5":0.07,"norm":{
                        "1":0.0006,"15":0.0075,"5":0.0044
                    }
                }
            }
        }
    }
}

Elastic Monitoring stats show that there are lots of open handles, which ties out with the log info above. I can also see waves of activity in the Elastic Monitoring showing that it goes through waves of sending events followed by long periods of doing nothing:

FileBeat Event Rates Filebeat CPU and Open Handles

Finally, my logstash logs show nothing other than standard INFO messages. No warnings about being overwhelmed. Monitoring shows an event latency of 12ms with occasional small spikes that coincide with a dump of events from FileBeat

LogStash monitoring

While trying to figure out what is happening, I read this:

Filebeat uses a backpressure-sensitive protocol when sending data to Logstash or Elasticsearch to account for higher volumes of data. If Logstash is busy crunching data, it lets Filebeat know to slow down its read. Once the congestion is resolved, Filebeat will build back up to its original pace and keep on shippin'.

However, neither that page, or anywhere I have looked in the documentation, explains how to identify if backpressure is occurring. If it's not backpressure, then what else could I investigate to identify what the bottleneck is?


Solution

  • At the time of writing, there is no specific logging in Filebeat or Logstash which makes it clear that there is 'backpressure' or where it is coming from. As was suggested on a few discussion forums, what you need to do is to run each component in isolation to benchmark the performance.

    1. Set Filebeat to output to console and redirect the console to /dev/null. You can then use filebeat logs to measure how long Filebeat is taking to process a specific file.
    2. Point Filebeat at Logstash, and redirect the Logstash output to /dev/null. Again, monitor the filebeat logs to see how long the processing takes. If it's slow at this point, then you may need to inspect the filters you are using. If you have simple or no filters, then maybe check if Logstash is CPU or memory constrained. Ideally there should be little/no slowdown processing through logstash vs just redirecting Filebeat to /dev/null
    3. Point Logstash at Elasticsearch and see if the performance changes. If so, now is the time to tweak your logstash pipeline settings, changing bulk sizes, worker counts etc. You might also need to look at Elasticsearch performance, determining if you are CPU/Memory/Network or Disk I/O bound.

    In my specific case, it was a mix of Logstash pipeline configuration and bumping of memory on my elasticsearch data nodes to solve my ingestion problem.