kubernetes google-kubernetes-engine fluent-bit

Fluentbit error "cannot adjust chunk size" on GKE

My services are running on GKE, and I am using EFK stack for logging. Each of node have a fluentbit pod by DaemonSet, and there is a fluentd aggregator pod. This structure had worked well at first, but fluentbit pods are making errors now. It continues making errors and restarting.

What is the reason of this error and how can I solve it?

Logs from fluentbit:

[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
[2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
[2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
[2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
[2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
[2023/07/18 08:08:22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
[2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
[2023/07/18 08:08:22] [error] failed initialize input tail.0
[2023/07/18 08:08:22] [error] [engine] input initialization failed
[2023/07/18 08:08:22] [error] [lib] backend failed

Events of fluentbit:

> kubectl describe po fluent-bit-xmkj6
...

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  51m (x1718 over 6d3h)   kubelet  Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
  Warning  BackOff  96s (x43323 over 6d3h)  kubelet  Back-off restarting failed container

fluent-bit.conf:

[SERVICE]
    Daemon Off
    Flush 1
    Log_Level info
    storage.path /fluent-bit/buffer/
    storage.sync full
    storage.checksum off        
    Parsers_File parsers.conf
    Parsers_File custom_parsers.conf
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    db /fluent-bit/buffer/logs.db
    multiline.parser docker, cri
    Tag kube.*        
    Skip_Long_Lines On
    Skip_Empty_lines On

[FILTER]
    Name kubernetes
    Match kube.**
    Kube_URL https://kubernetes.default.svc.cluster.local:443
    Kube_Tag_Prefix kube.var.log.containers.
    Merge_Log On
    Keep_Log Off
    Annotations Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
    Emitter_Name re_emitted_type
    Emitter_Storage.type filesystem

[FILTER]
    Name rewrite_tag
    Log_Level debug
    Match kube.**
    Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
    Emitter_Name re_emitted_no_type
    Emitter_Storage.type filesystem

[OUTPUT]
    Name forward
    Match *
    Retry_Limit False
    Workers 1
    Host 172.32.20.10
    Port 30006

Solution

This error has two possible causes.

The disk space is actually exhausted.
inotify resources are exhausted.

The disk space is actually exhausted:

You can check whether enough disk space is left on the node by running df command on the node.

# Check disk usage
df -h

# Check inode usage
df -ih

If you find disk space is pressured:

Remove unused files from the node.
Create a node pool with a larger disk size.

inotify resources are exhausted:

If you have enough disk space left on your node but still getting the no space left on device error, it's highly likely that inotify resources are exhausted.

kubectl logs -f uses inotify to monitor the changes on the file and it consumes the resource of inotify watches.

In Linux, there is a limitation of the number of inotify watches. You can check the current limitation by looking at fs.inotify.max_user_watches kernel parameter using the following:

$ sudo sysctl fs.inotify.max_user_watches

You can check how many inotify watches are consumed by each process on the node by using the following one liner

echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }

The above command will find large consumers of inotify watches.

There are a few options to mitigate the issue.

You can either change the application to not consume the large inotify watches.
Or you can increase the fs.inotify.max_user_watches kernel parameter. (e.g. sudo sysctl fs.inotify.max_user_watches=24576) - Note that each inotify watch consumes some memory footprint, so this solution should be used with caution.

You can deploy a DaemonSet to raise the limit for inotify watches on their cluster's nodes. This should be safe from a node stability perspective.

 command:
        - /bin/sh
        - -c
        - |
          while true; do
            sysctl -w fs.inotify.max_user_watches=524288
            sleep 10
          done
        imagePullPolicy: IfNotPresent