Search code examples
c#azureazure-data-lake-gen2azure-sdk

Using C# Azure.Storage.Files.DataLake to download datalake blobs (ideally in parallel)


I'd like to figure out how to download the files from a ADLS2 Storage blob directory - I have only a SAS url to the said directory, and I would like to recursively download all the files in that directory, hopefully in parallel.

It is very clear how to do this given the storage credentials, and there are many examples that show how to do it - but I couldn't find any which uses a SAS url.

Any clues or documentation links would be much appreciated! This is what works for me now, but anytime I change it to ReadToAsync, or try downloads with ParallelForEach, ParallelForEachAsync, or with a semaphore, the call to Read/ReadAsync crashes. Is there a better way to do this? Should I just abandon the library and just do webrequests to the REST API instead?:

DataLakeDirectoryClient directoryClient = new DataLakeDirectoryClient(_containerSasUri);
if (directoryClient.Exists())
{
    foreach (var blob in directoryClient.GetPaths(true))
    {
        if (blob.IsDirectory.HasValue && !blob.IsDirectory.Value)
        {
            blobClient.ReadTo(Path.Combine(downloadPath, blob.Name), 
            new DataLakeFileReadToOptions() { TransferOptions = new() { MaximumConcurrency = 10 } });
        }
    }
}

Solution

  • Using C# Azure.Storage.Files.DataLake to download datalake blobs (ideally in parallel)

    I have reproduced in my environment and got expected results as below:

    Inside ADLS Account:

    enter image description here

    Code:

    using Azure.Storage.Files.DataLake;
    using System;
    using System.IO;
    using System.Threading.Tasks;
    
    Console.WriteLine("************");
    Console.WriteLine("************");
    Console.WriteLine("Started Downloading Parallely");
    Uri conUri = new Uri("https://rithwik987.blob.core.windows.net/rithwik?sp=racwdlmeop&st=2023-06-28T05:12:24Z&se=2023-06-28T13:12:24Z&sv=2022-11-02&sr=c&0%3D");
    string downPath = @"C:\Users\Desktop\Files";
    DataLakeDirectoryClient dc = new DataLakeDirectoryClient(conUri);
    var files = dc.GetPaths().Where(b => (bool)!b.IsDirectory).ToList();
    
    Parallel.ForEach(files, b =>
    {
        DataLakeFileClient fc = dc.GetFileClient(b.Name);
        string destFilePath = Path.Combine(downPath, b.Name);
        Console.WriteLine("Downloading"+ b.Name);
        using FileStream downloadStream = File.OpenWrite(destFilePath);
        fc.ReadTo(downloadStream);
    
    });
    Console.WriteLine("Downloading Parallely Completed");
    Console.WriteLine("************");
    Console.WriteLine("************");
    

    Output:

    enter image description here

    enter image description here