Suppose you have a TransformBlock with configured parallelism and want to stream data trough the block. The input data should be created only when the pipeline can actually start processing it. (And should be released the moment it leaves the pipeline.)
Can I achieve this? And if so how?
Basically I want a data source that works as an iterator. Like so:
public IEnumerable<Guid> GetSourceData()
{
//In reality -> this should also be an async task -> but yield return does not work in combination with async/await ...
Func<ICollection<Guid>> GetNextBatch = () => Enumerable.Repeat(100).Select(x => Guid.NewGuid()).ToArray();
while (true)
{
var batch = GetNextBatch();
if (batch == null || !batch.Any()) break;
foreach (var guid in batch)
yield return guid;
}
}
This would result in +- 100 records in memory. OK: more if the blocks you append to this data source would keep them in memory for some time, but you have a chance to get only a subset (/stream) of data.
Some background information:
I intend to use this in combination with azure cosmos db, where the source could all objects in a collection, or a change feed. Needless to say that I don't want all of those objects stored in memory. So this can't work:
using System.Threading.Tasks.Dataflow;
public async Task ExampleTask()
{
Func<Guid, object> TheActualAction = text => text.ToString();
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 5,
MaxDegreeOfParallelism = 15
};
var throtteler = new TransformBlock<Guid, object>(TheActualAction, config);
var output = new BufferBlock<object>();
throtteler.LinkTo(output);
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
//...
throtteler.Complete();
await throtteler.Completion;
}
The above example is not good because I add all the items without knowing if they are actually being 'used' by the transform block. Also, I don't really care about the output buffer. I understand that I need to send it somewhere so I can await the completion, but I have no use for the buffer after that. So it should just forget about all it gets ...
Post()
will return false
if the target is full without blocking. While this could be used in a busy-wait loop, it's wasteful. SendAsync()
on the other hand will wait if the target is full :
public async Task ExampleTask()
{
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 50,
MaxDegreeOfParallelism = 15
};
var block= new ActionBlock<Guid, object>(TheActualAction, config);
while(//some condition//)
{
var data=await GetDataFromCosmosDB();
await block.SendAsync(data);
//Wait a bit if we want to use polling
await Task.Delay(...);
}
block.Complete();
await block.Completion;
}