Search code examples
c#asynchronousparallel-processingtpl-dataflow

Link TransformBlock producing IEnumerable<T> to block that receives T


I'm writing a web-gallery scraper and I want to parallel the processing for files as much as possible with TPL Dataflow.

To scrape, I first get the gallery main page and parse the HTML to get the image page links as a list. Then I go to each page in the list and parse the HTML to get the link to the image which I then want to save to disk.

Here's the outline of my program:

var galleryBlock = new TransformBlock<Uri, IEnumerable<Uri>>(async uri =>
{
    // 1. Get the page
    // 2. Parse the page to get the urls of each image page
    return imagePageLinks;
});

var imageBlock = new TransformBlock<Uri, Uri>(async uri =>
{
    // 1. Go to the url and fetch the image page html
    // 2. Parse the html to retrieve the image url
    return imageUri;
});

var downloadBlock = ActionBlock<Uri>(async uri =>
{
    // Download the image from uri to list
});

var opts = new DataflowLinkOptions { PropagateCompletion = true};
galleryBlock.LinkTo(imageBlock, opts); // this doesn't work, as I'm returning a list and not a single Item. However I want to progress that block in parallel.
imageBlock.LinkTo(downloadBlock, opts);

Solution

  • You can use a TransformManyBlock in place of your TransformBlock:

    var galleryBlock = new TransformManyBlock<Uri, Uri>(async uri =>
    {
        return Enumerable.Empty<Uri>(); //just to get it compiling
    });
    
    var imageBlock = new TransformBlock<Uri, Uri>(async uri =>
    {
        return null;  //just to get it compiling
    });
    
    var opts = new DataflowLinkOptions { PropagateCompletion = true };
    galleryBlock.LinkTo(imageBlock, opts); // bingo!