I am trying to add some additional aggregate functions to the Seq module. I was looking at the implementation of some of the functions listed here:
https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/seq.fs
One of the disclaimers there is "This function returns a sequence that digests the whole initial sequence as soon as that sequence is iterated. As a result this function should not be used with large or infinite sequences." This is true of many functions, such as the GroupBy.
First question: are there methods to write aggregate functions that can handle large sequences efficiently? I know "Large" is subjective; I'm just looking for general patterns to write such functions.
Second question: how do I ensure that the collections such as Dictionary(that are defined within the aggregate functions) are garbage collected efficiently? I understand Dictionaries should be collected when they go out of scope, but is there a way to indicate that explicitly? Given that the dictionary is scoped to remain within the function, I cant call .Clear() on that either right?
To answer your first question - In this case, the problem with large inputs is that the entire sequence has to be processed before functions like fold
or groupBy
can give result. There are a few things you can do:
Use functions like Seq.scan
that aggregate values just like fold
but yield current state after adding every element - the result is also sequence and you can laziliy consume it (and get more and more precise result, for example).
When writing functions that return seq<'a>
, you should design them so that getting the next element from the sequence only consumes some predictable number of elements of the input (but not the whole input sequence). This is not possible e.g. for groupBy
, but you can write grouping-like construct that groups only adjacent elements of the same group.
To answer the second question - you shouldn't generally worry about garbage collector too much. Forcing garbage collection at the end of function would probably cause more harm than just relying on the GC to work well.