I have a need for one ASP.NET Core server to download via the backend a massive number of files from another server using GET requests. (The platform doesn't matter, assume Dropbox, OneDrive, or anything else with API access).
Doing this serially is much too slow, so I have used Parallel.ForEachAsync
to iterate through the list of file IDs to be retrieved and GET them in parallel using a single HttpClient
instance. The approach works so far in a development environment, but I have some concerns about whether this method is appropriate for production.
Here's the question: What potential pitfals could arise from this method and what should be done to mitigate them (or what approach should be used instead, if this method is fatally flawed)?
To elaborate further, here are some of my concerns -
ParallelOptions.MaxDegreeOfParallelism
and if so what should that value be? I realize HTTP standards suggest no more than 2 concurrent connections to the same host, however modern browsers appear to use 6 or more sometimes. Obviously imposing no limit results in the fastest transfers, but is this safe to do?HttpClient
wait for ports to free up or just fail? Do I need to handle such failures with a retry?PS - I realize this question is very similar to this one, but unfortunately that question predated .NET6 and Parallel.ForEachAsync
so the answers largely focused on the pitfalls of using the blocking Parallel.ForEach
rather than the networking issues.
Do I need to limit the amount of parallelism
Yes.
what should that value be?
Difficult to tell, and will likely depend on the backend. 2-6 might be a good place to start, and adjust up or down as required.
Do I run a risk of running out of outbound TCP/IP ports to the remote host if too many users are doing the same thing at the same time?
According to this, the server should never run out of ports if http is used. The client (i.e. your server) could run out of ports, but you will likely run out of other resources (bandwith, threads, memory etc) before running out of ports.
Do I need to handle such failures with a retry?
you will need error handling. Retrying a few times before escalating the error is a very typical solution.
Do I run a risk of impacting the remote server negatively if I don't limit the concurrent requests? Could my application be mistaken for a DDOS attack? Would this risk be mitigated if I limit to 6 like a browser does?
Yes and yes. The risk will be lower with fewer requests, but would not be eliminated even if you are only doing sequential requests. It will depend on the specific service. The more you pay for the service, the higher the limits will likely be. Limits will likely not be documented, so it might be advisable to start slow.
Should I expect (and handle) potential HTTP 429 responses?
It depends on your goals. You could perhaps make some dynamic system that reduces request frequency depending on the response from the server. Or you could just find a static limit that works well enough.
download via the backend a massive number of files from another server using GET requests. (The platform doesn't matter, assume Dropbox, OneDrive, or anything else with API access)
Keep in mind that the terms of service might be incompatible with whatever you are trying to do. And the service provider may reserve the right to terminate service at their discretion.
Overall you will likely have less issues if you use fewer larger files than many small files.