Search code examples
socketsazure-web-app-servicedotnet-httpclienthttp2

What is the best way to avoid Azure App Service SNAT Port Exhaustion without NAT gateway


Some of our App Service running on .Net 6 is having intermittent connectivity issues. After following through the troubleshooting tool in Azure portal one specific instance(We have more than 1 instance due to scaling) of the App Service Plan is being capped at 128 SNAT port, yet some other instance can use 300 fine.

How do I resolve the problem for this specific instance?

Furthermore, I understand having NAT gateway can resolve the problem by creating more SNAT ports, but it incur additional cost.

I would like to fix this with code changes, I have tried common ways that people suggesting limiting HttpClient or even HttpMessageHandler into a singleton, but we still see hundreds of port usage.

We suspect this is due to the nature that our application talk with a lot of downstream applications that share the same load balancer(so same IP), yet with many different custom domain. I would like to find a way to get all those request reuse ports if possible, or whatever way that can reduce the port usage.


Solution

  • Although this answer seems to about.NET only, in fact similar approach can apply to other languages/runtimes.

    How to understand it better

    Yes you need to understand it before change anything, don't blindly tweak things without fully understand what you are dealing with!

    The best troubleshooting guide starts from Microsoft document: https://learn.microsoft.com/en-us/azure/app-service/troubleshoot-intermittent-outbound-connection-errors Within it, there is a link to this documentation which I find is the best to describe what SNAT is: https://4lowtherabbit.github.io/blogs/2019/10/SNAT/. Although written in 2019, it doesn't seem dated much, it mentioned old and new algorithm on port allocation. Where old is 160 and new is preallocated 128.

    The troubleshooting guide did say you get 128 preallocated port, and after that you may run into issue, which is an alternative way saying only 128 is guaranteed.

    A high level summary

    Basically, App Service Plan use an Azure Load Blancer for external network request. And the algorithm sharing the SNAT port is the same, detail is listed here: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections It says when you have a VM pool size of 201-400, load balancer SNAT port to 128.

    This probably means that Azure is trying to utilize their App Service stamp to be at top nearly 400 plans that shares the same load balancer. So whenever you got unlucky, that azure choose to create an App Service Plan in a busy 'box', you may not be able to use more than 128 SNAT ports, however if you got lucky with an instance running on a less busy 'box', you might see your app can easily chew more than 128, for us it was 300-ish with no issue other than troubleshooting tool saying SNAT port exhaustion detected with very limited failure.

    A short term workaround if you are trying to put off fire

    So the short term workaround could be keep scaling out and in of your app service until it destroy the unlucky instance you got, then you would be out of the water temporarily, or you might get even more unlucky instances and everything got capped at 128.

    Long term code level fix without NAT gateway

    Firstly, fixing things in code level may not always be possible, but you can always try analysis the application behaviour to see how it is used. And you do not necessarily need to check this in the cloud, TCP viewer(https://learn.microsoft.com/en-us/sysinternals/downloads/tcpview) is a good tool to help you understand sockets usage locally, although not exactly the same thing, if you could manage to reduce sockets usage, you will in turn reduce SNAT port.

    Trick one: Tweaking connection pool

    HttpClient actually matters less, because all connection pooling is done inside the HttpMessageHandler/HttpClientHandler. Further to this, when come to .Net Framework and .Net Core, how Http connection pool is being handled is different. A very good article explains this in detail can be found here: https://www.stevejgordon.co.uk/httpclient-connection-pooling-in-dotnet-core

    But either way you have some options to control the connection pooling with changing settings such as PooledConnectionLifetime, PooledConnectionIdleTimeout and MaxConnectionsPerServer. Be very careful changing MaxConnectionsPerServer as it could make a bit different on choking your code performance. Worth mentioning the default value of this in .Net Framework is 2, and in Core it is unlimited.

    I personally found PooledConnectionIdleTimeout the most useful and the least risky to change, the idea of this is try to reduce reestablish http connection when you can, but whatever the default value is(again mind the different between Full framework and Core), it was not chosen with 128 SNAT port in mind. It is chosen based on when you run your code in an OS that you have 65536 sockets. So when you have that available pool shrunk to below 2% (128 / 65536), it doesn't seem to be a terrible idea to change this setting to a smaller value, 2% of the default could be too excessive, check your traffic with a good observability tool, look at outgoing traffic and figure out a value. (In our case, I picked 5 seconds).

    I have only checked the open source dotnet runtime, the code managing the connection pool has sealed class using domain as part of the connection pool key. So even if your downstream services in fact behind the same load balancer with same IP, there is no way to change the code to make them sharing the same pool. I have not check full framework implementation but would imagine it would be very similar.

    Trick two: Using HTTP/2 by default if downstream support it

    I saved this at last, but it is the most powerful change I have ever seen. When googling around, I found no information whatsoever on internet between HTTP/2 and SNAT port. Which is the main reason I chose to type all those out to answer my own question here, in hope to help folks wondering the same thing in the future.

    Use HTTP/2! Seriously, use it and check the result!

    It took me sometime to connect the dots because no one on internet mention this clearly. But if you look closer, skip all the pep talk about how great HTTP/2 is on binary transportation, header compression and server push all that, not saying those are not great, the biggest advantage with our context is that with HTTP/2, you can send concurrent request to the server with only one TCP connection, there is no longer a need to create new connection when something else is using the port, as long as it is the same server, your request can be chucked in there to reuse the sockets/SNAT port.

    For Full Framework, you can do this but I have not tried, some doc from MS: https://learn.microsoft.com/en-us/dotnet/api/system.net.http.httpclient?view=netframework-4.8

    For .Net Core, it is very easy as setting a property to HttpClient or change specific messages, an example article here: https://www.siakabaro.com/use-http-2-with-httpclient-in-net-6-0/

    To our specific case, our SNAT port usage dropped from 300 to around 30, this is on top of me tweaking idle timeout to 5 seconds, because we have 60-ish domains pointing to that load balancer, but only 30-ish has heavy traffic, so as a result it barely uses more than 30 SNAT port. I also did extreme case simulation, actually the busier your traffic the greater the improvement you will observe, as no matter how horribly it was in HTTP/1 (I made it more than 10k SNAT port request that all failed), it can easily shrink down to 1 port per domain.

    Something worth mentioning, but I have not tested out, we have multiple (5) App Service running in the single service plan, so an instance is hosting many App Services. When you consider 60-ish domains in each 50, you got 300-ish SNAT port. I suspect once opted in to use HTTP/2, TCP connection might be then shared between those App Services in that same instance, hence it could really come down to 30. But I have not validated this, so don't take my word for it.

    Lastly the great thing after this is, if SNAT port if your bottleneck around how many App Service can you put into a service plan, after HTTP/2 code change you are likely able to put a few more in it, which is a wonderful cost saving trick!

    I hope this would help someone, if you gone through all these you should now realize those approaches does not just apply to .Net world, anything you use if support the same idea allowing you change pool behavior and ope in HTTP/2 will benefit, and thanks for reading to the end!