Search code examples
google-cloud-platformgpugoogle-kubernetes-enginebandwidth

Google Cloud Kubernates nodes bandwidth


I'm deploying my cluster on Google Cloud Kubernates service. It already has a few nodes. Also, I need the server with GPU from Google Cloud to make it work with my cluster. GPU instance continuously processes the incoming traffic (bandwidth should be up to 1Gb/s) and sends the results on cluster nodes (bandwidth should be even more than incoming bandwidth).

The most critical things for me in the project:
1) bandwidth between these nodes inside cluster;
2) bandwidth between the node and the GPU server;
3) bandwidth between the GPU server and the world;
4) bandwidth between the node and world.

The minimum appropriate bandwidth for each node is 1 Gb/s on downloading and uploading both. When I make speed tests, it shows download speed 100-680 Mb/s and upload speed 67-138 Mb/s for the same node on the same time (screenshots below were made in period 30 seconds between each other). So the current bandwidth is too small and unstable. But I need stable bandwidth starting from 1 Gb/s.

I tried to find any technical specification or pricing on bandwidth in Google Docs. But, there are only CPU/GPU/RAM/Disk, not bandwidth in the technical specification. And there is only traffic per month pricing on docs.

TL;DR:
How can I set stable 1 Gb/s or more bandwidth for each of the cluster nodes, GPU instance and any other Google Cloud virtual machine? Is there any service in Google Cloud that provides bandwidth of more than 1 Gb/s? Is there any solution/service in Google Cloud how to handle big Internet traffic?

The first speed test

The second speed test in 30 seconds

The third speed test in 60 seconds since the first

The fourth speed test in 90 seconds since the first

P.S. speed tests were made via:

npx speedo-cli

Solution

  • There's no guarantee really, especially when it comes to traffic to/from networks outside GCP. Here's a few things you can do to maximize bandwidth though:

    1. Increase the number of CPU cores per instance:

      caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine. source

      • Note that the 2 Gbps per vCPU cap represents a theoretical limit using internal networks:

        The cap is a limit that can't be exceeded and doesn't indicate the actual throughput of your egress traffic. There is no guarantee that your traffic will achieve the maximum throughput, which depends on many factors other than the cap. source

    2. In case of traffic between VMs (i.e., cases 1 and 2 in your question) make sure the VMs are located in the same zone and you're using internal IPs:

      Any time you transfer data or communicate between VMs, you can achieve max performance by always using the internal IP to communicate. In many cases, the difference in speed can be drastic. source

      • Check this answer on how to measure network bandwidth between VMs using iperf.
    3. For advanced use cases you can try fine-tuning the TCP window size in your VMs.

    Finally, one benchmark observed that the GCP network throughput is 81x more variable when compared to AWS. Naturally this just reflects one benchmark but you might find it worthwhile to test other providers yourself.