Search code examples
c#.nettask-parallel-librarytpl-dataflowpipelining

TPL Dataflow Batchblock Duplicate elements


My DataFlow pipeline starts with a BatchBlock and several Tasks are posting items into this BatchBlock. Now, this BatchBlockpropagates data to the next block depending on a Timer with the help of the TriggerBatch() method.

In this case, you can assume that none of the batches are of the (very high) batch size provided during the creation of the BatchBlock i.e. each triggered batch could be of a different size.

Just before triggering the BatchBlock I would like to remove all duplicate items present in the batch that is about to be propagated to the next block in the pipeline. Is there a way I can do that?


Solution

  • You can't add or remove items that are stored inside blocks.

    However, you can add a TransformBlock after the BatchBlock that removes duplicates for the current batch and moves the batch forward. Keep in mind that it means your batches may be smaller.

    Assuming equality members are implemented correctly it can look like this:

    var transformBlock = new TransformBlock<int[], IEnumerable<int>>(_ => new HashSet<int>(_));