Search code examples
azure.net-coreblobazure-storageazure-blob-storage

Azure Blobs C# client - using multiple filters server side


I'm trying to load blob names for filtering in my program, then after applying all filters I plan to download and process each blob. Currently we have around 30k blobs in storage which are stored inside container like this: year/month/day/hour/file.csv (or file.json for unprocessed files)

My program needs to dynamically enter start and end date (max length of 30 days) for downloading. Using Azure.Storage.Blobs.BlobContainerItem and method GetBlobs allows me to use single string prefix for server side filtering.

If my dates are 2020/06/01 and 2020/06/02 program works very fast and takes around 2 seconds to get blobs and apply rest of filters to it. However, if i have 2020/05/30 and 2020/06/01 then I'm unable to put month prefix because it takes only 1 string so my prefix will be just 2020, which takes around 15 seconds to complete. Rest of the filtering is done locally but biggest delay is the GetBlobs() function.

Is there any other way to use multiple filters server side from .NETCore app?

Here are relevant functions:

        BlobContainerClient container = new BlobContainerClient(resourceGroup.Blob, resourceGroup.BlobContainer);
        var blobs = container.GetBlobs(prefix : CreateBlobPrefix(start, end))
            .Select(item=> item.Name)
            .ToList();
        blobs = FilterBlobList(blobs, filter, start, end);

    private string CreateBlobPrefix(DateTime start, DateTime end)
    {
        string prefix = null;
        bool sameYear = start.Year == end.Year;
        bool sameMonth = start.Month == end.Month;
        bool sameDay = start.Day == end.Day;
        if (sameYear)
        {
            prefix = start.Year.ToString();
            if (sameMonth)
            {
                if(start.Month<10)
                    prefix += "/0" + start.Month.ToString();
                else
                    prefix += "/" + start.Month.ToString();
                if (sameDay) 
                    if(start.Day<10)
                        prefix += "/0" + start.Day.ToString();
                    else
                        prefix += "/" + start.Day.ToString();
            }
        }
        return prefix;

EDIT: here's how i did it in the end. Because it's faster to make multiple requests with better specified prefixes i did the following:

  • create a list of different dates in selected time window (coming from UI application where user inputs any window)
  • for each prefix created I send the request to Azure to get blobs
  • concat all blob names into 1 list
  • process the list by using blob client for each blob name

Here's the code:

        foreach (var blobPrefix in CreateBlobPrefix(start, end))
        {
            var currentList = container.GetBlobs(prefix: blobPrefix)
                .Select(item => item.Name)
                .ToList();
            blobs = blobs.Concat(currentList).ToList();
        }

Solution

  • You could filter more than once, finding the common denominator between the dates:

    First filter with the string prefix by the start month and year, 2020/05, and then filter locally for exact date.

    Then you can gradually increase the day/month filter until you reach the end of the range.

    The granularity of your stepping really depends on the time it takes to make a call to Azure for a given average number of results. Another advantage is you could run these sub-queries in parallel.

    I've used this code:

        var prefixDateFilters = Enumerable.Range(0, 1 + endDateInclusive.Subtract(startDateInclusive).Days)
                                          .Select(offset => startDateInclusive.AddDays(offset))
                                          .Select(date => $"{date.ToString(BlobFileDateTimeFormat)}").ToList();
    
        prefixFilters.AsParallel()
                     .Select(filter => containerClient.GetBlobs(prefix: filter))