Search code examples
c#asynchronousparallel-processingtaskwebclient

Efficient way to download a huge load of files in parallel


I'm trying to download a huge load of files(pictures) from the internet. I'm stuggling with async/parallel, because

a) I cant say whether there is a file, or not. I just got a million links provided with either a singe picture (300kb to 3MB) or 404 page does not exist. So to escape downloading an 0-Byte file, i ask the same page twice, once for 404 and after that for the picture. THe other way would be downloading all 0-byte files and deleting millions of them afterwards - which keeps windows 10 stuck on this task until i reboot.

b) While the (very slow) download is in progress, whenever I have a look at any of the "successfull downloaded files", it is created with 0 bytes and dont contain the picture. What do I need to change, to really download the file before going to download the next one?

How do i fix this both issues? Is there any better way to download tousands or millions of files (compression/creating .zip on the server is not possible)

           //loopResult = Parallel.ForEach(_downloadLinkList, new ParallelOptions { MaxDegreeOfParallelism = 10 }, DownloadFilesParallel);    
            private async void DownloadFilesParallel(string path)
            {
                string downloadToDirectory = ""; 
                string x = ""; //in case x fails, i get 404 from webserver and therefore no download is needed
                System.Threading.Interlocked.Increment(ref downloadCount);
                OnNewListEntry(downloadCount.ToString() + " / " + linkCount.ToString() + " heruntergeladen"); //tell my gui to update
                try
                {
                    using(WebClient webClient = new WebClient())
                    {
                        downloadToDirectory = Path.Combine(savePathLocalComputer, Path.GetFileName(path)); //path on local computer

                        webClient.Credentials = CredentialCache.DefaultNetworkCredentials;
                        x = await webClient.DownloadStringTaskAsync(new Uri(path)); //if this throws an exception, ignore this link
                        Directory.CreateDirectory(Path.GetDirectoryName(downloadToDirectory)); //if request is successfull, create -if needed- the folder on local pc
                        await webClient.DownloadFileTaskAsync(new Uri(path), @downloadToDirectory); //should download the file, release 1 parallel task to get the next file. instead there is a 0-byte file and the next one will be downloaded
                    }
                }
                catch(WebException wex)
                {
                }
                catch(Exception ex)
                {
                    System.Diagnostics.Debug.WriteLine(ex.Message);
                }
                finally
                {
                    
                }
            }

//picture is sfw, link is nsfw enter image description here


Solution

  • Here's the example using HttpClient with limit of maximum concurrent downloads.

    private static readonly HttpClient client = new HttpClient();
    
    private async Task DownloadAndSaveFileAsync(string path, SemaphoreSlim semaphore, IProgress<int> status)
    {
        try
        {
            status?.Report(semaphore.CurrentCount);
            using (HttpResponseMessage response = await client.GetAsync(path, HttpCompletionOption.ResponseHeadersRead).ConfigureAwait(false))
            {
                if (response.IsSuccessStatusCode) // ignoring if not success
                {
                    string filePath = Path.Combine(savePathLocalComputer, Path.GetFileName(path));
                    string dir = Path.GetDirectoryName(filePath);
                    Directory.CreateDirectory(dir);
                    using (Stream responseStream = await response.Content.ReadAsStreamAsync().ConfigureAwait(false))
                    using (FileStream fileStream = File.Create(filePath))
                    {
                        await responseStream.CopyToAsync(fileStream).ConfigureAwait(false);
                    }
                }
            }
        }
        finally
        {
            semaphore.Release();
        }
    }
    

    The concurrency

    client.BaseAddress = "http://somesite";
    int downloadCount = 0;
    List<string> pathList = new List<string>();
    // fill the list here
    
    List<Task> tasks = new List<Task>();
    int maxConcurrentTasks = Environment.ProcessorCount * 2; // 16 for me
    
    IProgress<int> status = new Progress<int>(availableTasks =>
    {
        downloadCount++;
        OnNewListEntry(downloadCount + " / " + pathList.Count + " heruntergeladen\r\nRunning " + (maxConcurrentTasks - availableTasks) + " downloads.");
    });
    
    using (SemaphoreSlim semaphore = new SemaphoreSlim(maxConcurrentTasks))
    {
        foreach (string path in pathList)
        {
            await semaphore.WaitAsync();
            tasks.Add(DownloadAndSaveFileAsync(path, semaphore, status));
        }
        try
        {
            await Task.WhenAll(tasks);
        }
        catch (Exception ex)
        {
            // handle the Exception here
        }
    }
    

    Progress here simply executes callback on UI Thread. Thus Interlocked is not needed inside and it's safe to update UI.

    In case of .NET Framework (in .NET Core has no effect but not needed) to make it faster, you may add this line to the app startup code

    ServicePointManager.DefaultConnectionLimit = 10;