Search code examples
c#async-awaitparallel-processingironpdf

How to speed up IronPdf when using async/await


I'm trying to make a piece of code run faster. The code is already using async/await. But it's still slow.

So I tried to alter my foreach to use the new IAsyncEnumerable. However I gained 0 performance from this. And it appears to run the code sequentially. Which surprised me. I thought the await foreach would run each iteration in its own thread.

Here's my attempt at speeding up the code.

var bag = new ConcurrentBag<IronPdf.PdfDocument>(); // probably don't need a ConcurrentBag
var foos = _dbContext.Foos;
await foreach (var fooPdf in GetImagePdfs(foos))
{
    bag.Add(fooPdf);
}

private async IAsyncEnumerable<IronPdf.PdfDocument> GetImagePdfs(IEnumerable<Foo> foos)
{
    foreach (var foo in foos)
    {
        var imagePdf = await GetImagePdf(foo);

        yield return imagePdf;
    }
}

private async Task<IronPdf.PdfDocument> GetImagePdf(Foo foo)
{
    using var imageStream = await _httpService.DownloadAsync(foo.Id);
    var imagePdf = await _pdfService.ImageToPdfAsync(imageStream);

    return imagePdf;
}

using IronPdf;
public class PdfService
{
    // this method is quite slow
    public async Task<PdfDocument> ImageToPdfAsync(Stream imageStream)
    {
        var imageDataURL = Util.ImageToDataUri(Image.FromStream(imageStream));
        var html = $@"<img style=""max-width: 100%; max-height: 70%;"" src=""{imageDataURL}"">";
        using var renderer = new HtmlToPdf(new PdfPrintOptions()
        {
            PaperSize = PdfPrintOptions.PdfPaperSize.A4,
        });
        return await renderer.RenderHtmlAsPdfAsync(html);
    }
}

I also gave Parallel.ForEach a try

Parallel.ForEach(foos, async foo =>
{
    var imagePdf = await GetImagePdf(foo);
    bag.Add(imagePdf);
});

However I keep reading that I shouldn't use async with it, so not sure what to do. Also the IronPdf library crashes when doing it that way.


Solution

  • The problem with your foreach and await foreach approaches is they are going to execute sequentially (even though they take advantage of the async and await pattern). Essentially, await does exactly that, awaits.

    In regards to the Parallel.ForEach your suspicions are correct, it's not suitable for async methods an IO bound workloads. Parallel.ForEach takes an Action delegate and giving an async lambda to an Action actually just creates an async void with the consequence of each task running unobserved (which has several disadvantages).

    There are many approaches to take from here, but the simplest is to start each task hot, project them to a collection, and await them all to completion. This way you are letting the IO bound workloads offload (term used loosely) to an IO Completion Port, thus allowing any potential thread to go back to the thread pool to get reused by the Task Scheduler efficiently until the IO work completes.

    Assuming there are no shared resources, just project the started tasks to an IEnumerable<Task<PdfDocument>> and use Task.WhenAll

    Creates a task that will complete when all of the supplied tasks have completed.

    var tasks = _dbContext.Foos.Select(x => GetImagePdfs(x))
    var results = await Task.WhenAll(tasks);
    

    In the above scenario, when Select enumerates the async method GetImagePdfs each Task is started hot, the Task Scheduler takes care of scheduling any threads that are needed from the threadpool. As soon as any code awaits an IO job a callback is made with the operating system and the thread goes back to the pool to get reused, so on and so forth. Task.WhenAll waits for all the tasks to complete or fault then returns a collection of each result.