Search code examples
postgresqldockerzabbixdebian-jessiedisk-io

Zabbix slow disk request responses


I have installed Zabbix-server 5.0 on a Docker Container, following the official guide (for postgresql): https://www.zabbix.com/documentation/current/manual/installation/containers

After that, I installed a Zabbix-Agent 5.0 for Debian 8 on the machine that host the docker daemon (and the dockerized Zabbix-server), from the official source: https://repo.zabbix.com/zabbix/5.0/debian/pool/main/z/zabbix-release/zabbix-release_5.0-1+jessie_all.deb

When I configured the server and the agent (I used a template: Template OS Linux by Zabbix agent), I started to get good data from the agent, but inmediatly I got a Problem on the "physical" server (it's an VPS):

vda: Disk read/write request responses are too high (read > 20 ms for 15m or write > 20 ms for 15m)

I started to check the disk with nmon command, and I saw the spikes of 100% on the disk, but is not happening from more than 5 seconds, the rest of the time the disk maintains 5-15% of usage.

In the server I only have the Docker, the Zabbix-server with Postgres and Zabbix-agent. Nothing more. (I got the server because it was and old unused VPS)

Could you help me to:

  1. verify if this is a real problem and not a false-positive?
  2. verify if Zabbix is creating this issue alone?
  3. verify if there is a solution to this problem?

Thanks in advance


Solution

  • The solution is on the zabbix forum, you just need to change the default thresholds in Zabbix Templates.

    Data Collections -> Templates
    Find the template used: "Linux by Zabbix agent"; Edit it.
    Template -> Macros -> Inherited and template macros
    {$VFS.DEV.READ.AWAIT.WARN}  : replace 20 with 35
    {$VFS.DEV.WRITE.AWAIT.WARN} : replace 20 with 50
    

    These thresholds are necessarily linked to your equipment. A distributed infrastructure with Ceph and Proxmox will by definition have poorer input/output performance than a server running on its own. I don't know if you can adapt these thresholds according to groups of machines, but I suppose so...