My services are running on GKE, and I am using EFK stack for logging. Each of node have a fluentbit pod by DaemonSet, and there is a fluentd aggregator pod. This structure had worked well at first, but fluentbit pods are making errors now. It continues making errors and restarting.
What is the reason of this error and how can I solve it?
Logs from fluentbit
:
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072144.132487045.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.865639031.flb' to 4096 bytes
[lib/chunkio/src/cio_file_unix.c:528 errno=28] No space left on device
[lib/chunkio/src/cio_file.c:1116 errno=28] No space left on device
[2023/07/18 08:08:22] [error] [storage] cannot adjust chunk size '/fluent-bit/buffer//emitter.3/1-1689072143.703709663.flb' to 4096 bytes
[2023/07/18 08:08:22] [ info] [storage] ver=1.3.0, type=memory+filesystem, sync=full, checksum=off, max_chunks_up=128
[2023/07/18 08:08:22] [ info] [storage] backlog input plugin: storage_backlog.1
[2023/07/18 08:08:22] [ info] [cmetrics] version=0.5.7
[2023/07/18 08:08:22] [ info] [ctraces ] version=0.2.5
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] initializing
[2023/07/18 08:08:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2023/07/18 08:08:22] [error] [sqldb] error=disk I/O error
[2023/07/18 08:08:22] [error] [input:tail:tail.0] db: could not create 'in_tail_files' table
[2023/07/18 08:08:22] [error] [input:tail:tail.0] could not open/create database
[2023/07/18 08:08:22] [error] failed initialize input tail.0
[2023/07/18 08:08:22] [error] [engine] input initialization failed
[2023/07/18 08:08:22] [error] [lib] backend failed
Events of fluentbit
:
> kubectl describe po fluent-bit-xmkj6
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 51m (x1718 over 6d3h) kubelet Pulling image "cr.fluentbit.io/fluent/fluent-bit:2.0.5"
Warning BackOff 96s (x43323 over 6d3h) kubelet Back-off restarting failed container
fluent-bit.conf
:
[SERVICE]
Daemon Off
Flush 1
Log_Level info
storage.path /fluent-bit/buffer/
storage.sync full
storage.checksum off
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
Health_Check On
[INPUT]
Name tail
Path /var/log/containers/*.log
db /fluent-bit/buffer/logs.db
multiline.parser docker, cri
Tag kube.*
Skip_Long_Lines On
Skip_Empty_lines On
[FILTER]
Name kubernetes
Match kube.**
Kube_URL https://kubernetes.default.svc.cluster.local:443
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
Annotations Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name rewrite_tag
Log_Level debug
Match kube.**
Rule $kubernetes['labels']['type'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['labels']['type'].$kubernetes['container_name'] false
Emitter_Name re_emitted_type
Emitter_Storage.type filesystem
[FILTER]
Name rewrite_tag
Log_Level debug
Match kube.**
Rule $kubernetes['container_name'] ^(.*)$ dev.service.$kubernetes['namespace_name'].$kubernetes['container_name'] false
Emitter_Name re_emitted_no_type
Emitter_Storage.type filesystem
[OUTPUT]
Name forward
Match *
Retry_Limit False
Workers 1
Host 172.32.20.10
Port 30006
This error has two possible causes.
The disk space is actually exhausted:
You can check whether enough disk space is left on the node by running df command on the node.
# Check disk usage
df -h
# Check inode usage
df -ih
If you find disk space is pressured:
inotify resources are exhausted:
If you have enough disk space left on your node but still getting the no space left on device error, it's highly likely that inotify resources are exhausted.
kubectl logs -f
uses inotify to monitor the changes on the file and it consumes the resource of inotify watches.
In Linux, there is a limitation of the number of inotify watches. You can check the current limitation by looking at fs.inotify.max_user_watches kernel parameter using the following:
$ sudo sysctl fs.inotify.max_user_watches
You can check how many inotify watches are consumed by each process on the node by using the following one liner
echo -e "COUNT\tPID\tUSER\tCOMMAND" ; sudo find /proc/[0-9]*/fdinfo -type f 2>/dev/null | sudo xargs grep ^inotify 2>/dev/null | cut -d/ -f 3 | uniq -c | sort -nr | { while read -rs COUNT PID; do echo -en "$COUNT\t$PID\t" ; ps -p $PID -o user=,command=; done }
The above command will find large consumers of inotify watches.
There are a few options to mitigate the issue.
You can deploy a DaemonSet to raise the limit for inotify watches on their cluster's nodes. This should be safe from a node stability perspective.
command:
- /bin/sh
- -c
- |
while true; do
sysctl -w fs.inotify.max_user_watches=524288
sleep 10
done
imagePullPolicy: IfNotPresent