google-cloud-platform google-kubernetes-engine fluent-bit

GKE Fluent bit partial logs

I have K8S cluster in GCP (version is 1.20.8-gke.900 from the regular update channel). All cluster pods write logs in STDOUT or STDERR from Docker containers.

A couple of weeks ago we found that some log entries are missing in the GCP logging console. I can see them via kubectl tool but looks like they don't reach the logging bucket. For example, I can hit API in the pod with invalid payload to emulate error in the logs, and sometimes this error reaches the logging bucket, sometimes no. Super weird to me...

The traffic and resource utilization in the cluster is super small.

As I understood fluent bit daemonset is responsible to fetch logs from pods and pass them into logging bucket. Current version of fluent bit: gke.gcr.io/fluent-bit:v1.5.7-gke.1 & gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0.

I don't see any errors in the fluent bit logs...

Could you please suggest what can be done to trace/debug/troubleshoot such case?

Thanks!

Solution

It appears the issue is with the log volume. The managed GKE logging agent is guaranteed at least 100KiB/s throughput and performance can be higher depending on other node factors.

If your workloads on a GKE node are generating significantly more than 100KiB/s, then it's possible that the logs are not being collected due to the log volume.

If you're generating more than 100kb/s, then there's a few workarounds:

Generate less logs.
Leave the node in question partially idle. This will allow fluentbit to pick up extra cpu cycles and process more logs.
Run your own instance of fluentbit with a higher resource allocation.

The underlying root cause of the 100kb/s limitation is that we only give a small resource allocation to fluentbit so as to leave more resources available for your workloads.

Refer to link for additional information.