In Kubernetes services talk to each other via a service ip. With iptables or something similar each TCP connection is transparently routed to one of the pods that are available for the called service. If the calling service is not closing the TCP connection (e.g. using TCP keepalive or a connection pool) it will connect to one pod and not use the other pods that may be spawned.
What is the correct way to handle such a situation?
My own unsatisfying ideas:
Am I making every call slower only to be able to distribute requests to different pods? Doesn't feel right.
I could force the caller to open multiple connections (assuming it would then distribute the requests across these connections) but how many should be open? The caller has (and probably should not have) no idea how many pods there are.
I could limit the resources of the called services so it gets slow on multiple requests and the caller will open more connections (hopefully to other pods). Again I don't like the idea of arbitrarily slowing down the requests and this will only work on cpu bound services.
You can use a ServiceMesh platform such as Istio or Linkerd2 to manage loadBalancing grpc or "keep-alive" connections.