Performance of sequences with while vs. for-do comprehensions, compared to direct `IEnumerable<T>` implementation

(sorry for the long post, to skip directly to the question(s), see bottom)
(UPDATE: if you are revisiting, please see sections marked "update" ;)

I set myself to understand better what was going on under the hood with F# sequences. A task I needed to optimized involved converting strings into a sequence of Unicode codepoints and I was wondering if I could replace the mutable loop we were using into an immutable one without sacrificing too much performance.

One of the challenges is that the returned sequence does not have the same length as the input sequence, because of surrogate pairs that together return one integer. This was the original code looking like:

let stocp0(value: string) : seq<int> =
    let pos = ref 0
    seq {
        while !position < value.Length do
            let c = System.Char.ConvertToUtf32(value, !pos)
            pos := !position + if c >= 0x10000 then 2 else 1
            yield c
    }

Attempt 1: `for-do`

I figured that the easiest thing to do was to turn it into a for-do loop (not a for-in-do loop, they have too much extra overhead):

let inline stocp1(value: string) =
    seq {
        for i = 0 to value.Length - 1 do
            if(not <| Char.IsLowSurrogate(value.[i])) 
            then yield Char.ConvertToUtf32(value, i)
    }

This performed 3.2 times slower than the mutable counterpart above. A higher factor than I had imagined.

Attempt 2: `Seq.mapi`

Since a string is already a sequence (ok, there's an IEnumerable<char> wrapper) I thought to utilize that with existing sequence functions from the Seq module, hoping that would perhaps bring better performance:

let inline stocp2(value: string) =
    value
        |> Seq.mapi (fun i c -> 
            if(Char.IsLowSurrogate(c)) then 0
            else Char.ConvertToUtf32(value, i))
        |> Seq.filter ((<>) 0)

It performed 3.5 times slower.

Strangely, if I replace value with value.AsEnumerable(), it performs significantly faster than stocp1: factor 3.0.

After several more tests it became clear to me that each |> creates a new IEnumerable<T> layer, with all the chaining operations involved (this can also be observed in the FSharp source code of Seq). But the size of the overhead surprised me. Since none of the above had even remotely equal performance of the original, I decided to try to prevent the extra sequence overhead to kick in and create a Seq.mapiAndFilter function to do both actions at once.

Attempt 3: `Seq.mapiAndFilter`

Since it is such a delicate loop and I only need to filter on the current character and return based on the current position, perhaps I could remove the extra step involved with Seq.mapi, which seems expensive.

To do that, I needed to mimic the behavior of the existing Seq.xxx functions and my first attempt was to do that with a while-yield loop. This would be closest to the original mutable approach but adds one layer of IEnumerable<T> overhead.

I wrote the following function, which takes a function that returns a boolean and if true, it applies the second function on the position of the current element.

let inline mapiAndFilter f g (e: seq<_>) : seq<_> =
    let position = ref -1
    let e = e.GetEnumerator()
    seq {
        while e.MoveNext() do
            position := !position + 1
            if f e.Current then yield g !position
    }


// and the application of it:
let inline stocp3(value: string) =
    value.AsEnumerable()
        |> mapiAndFilter (not << Char.IsLowSurrogate) (fun i -> Char.ConvertToUtf32 (value, i))

The result was much better than previous attempts, it clocked in at 1.5 times the performance of the mutable solution. Still disappointingly slow, though, it seemed to imply that the added overhead with the enumerators is about 50% in tight loops.

Attempt 4: improved `Seq.mapiAndFilter`

To find out what was going on under the hood, I then decided to write the enumerable type explicitly, which should give me the opportunity to find out whether any boilerplate checks added in the FSharp libraries had something to do with the low performance characteristics.

Without the safe-guards FSharp Seq functions use internally (to raise an error on illegal usage of Current etc), I came up with this:

let mapiAndFilter f g (e : seq<'a>) : seq<'b> =
    let i = ref -1
    let e = e.GetEnumerator()
    let inline getEnum() = {
            new IEnumerator<'b> with 
                member x.Current = g !i
            interface System.Collections.IEnumerator with 
                member x.Current = box <| g !i
                member x.MoveNext() = 
                    let rec next() = 
                        i := !i + 1
                        e.MoveNext() && (f e.Current || next())
                    next()
                member x.Reset() = noReset()
            interface System.IDisposable with 
                member x.Dispose() = e.Dispose()  
        }
    {
    new System.Collections.Generic.IEnumerable<'b> with
        member x.GetEnumerator() = getEnum()
    interface System.Collections.IEnumerable with
        member x.GetEnumerator() = getEnum() :> System.Collections.IEnumerator
    }

// apply the same way as before:
let inline stocp4(value: string) =
    value.AsEnumerable()
        |> mapiAndFilter (not << Char.IsLowSurrogate) (fun i -> Char.ConvertToUtf32 (value, i))

This became our current winner! It appeared to clock in at only 1.1 times slower than the original mutable function. Of course, it uses mutable state, but so do all the Seq.xxx functions internally anyway.

Performance comparison overview

General note on all attempts above: I have also tested with ToCharArray(), which improves performance on small-to-medium input, but becomes detrimental on large input strings, esp. when not all items need to be enumerated. Many other approaches I left out, because their performance was so much worse (Seq.choose over Seq.filter is much slower, Seq.collect, very slow etc).

I used the following for performance comparison (apparently, Seq.length is the fastest way to force-iterate, Seq.last and Seq.iter are much slower):

let input = "ab\U0001ABCDcde\U0001ABCEfghi\U0001ABCF"
let run f = for i in 1 .. 1000000 do f input |> Seq.length |> ignore;;
run stocp1
// etc

Results:

Function  CPU     Factor
stocp0    0.296   1.0
stocp1    0.951   3.2
stocp2    1.029   3.5
stocp2'   0.873   3.0
stocp3    0.436   1.5
stocp4    0.327   1.1
stocp5    0.405   1.3 (latkin's answer, adj. with Array.toSeq)

stocp' is the version that uses AsEnumerable() on the string prior to passing it to the Seq.xxx functions. All other functions already use this.

I also tested with longer and with very large (50MB) strings, which is our typical use-case, and while the timings are less steady on subsequent runs, the effective factors are about the same as above.

Update: I added latkin's answer as stocp5, but had to adjust by adding a Array.toSeq to it. Without it, it clocks in at 0.234, which is faster than the original while-loop. Unfortunately, I require a sequence (we must use lazy loading and cannot hold whole strings in memory).

(Update) Performance comparison incl element access

The above comparison only tests iteration, which helps find the issues caused by the stacked iterators. However, the timings are a little different if you add element access to the equation. I enforced it by an added Seq.map id:

let runmap f = for i in 1 .. 1000000 do f input |> Seq.map id |> Seq.length |> ignore;;

Results:

Function  CPU     Factor
stocp0    0.795   1.0
stocp1    1.528   1.9
stocp2    1.731   2.2
stocp2'   1.812   2.3
stocp3    0.936   1.2
stocp4    0.842   1.1
stocp5    0.873   1.1  (credit: latkin, see his answer and notes above)

(Update) Performance comparison incl limited element access

Since our typical use-cases do not require full iteration, I've added a test that just iterates until the 2nd surrogate pair at position 6, with larger sized input (3932160 characters).

let runmapnth f = for i in 1 .. 1000000 do f input |> Seq.map id |> Seq.nth 6 |> ignore;;

Results:

Function  CPU     Factor
stocp0    0.624   1.0
stocp1    1.029   1.6
stocp2    1.263   2.0
stocp2'   1.107   1.8
stocp3    0.717   1.1
stocp4    0.624   1.0
stocp5    ---     --- OOM

The OutOfMemoryException with latkin's answer surprised me a bit, it means that the created arrays were not cleaned up when used in a tight loop as above. My machine allocated 8GB a few times in a few seconds, and drops (GC'ed?) in-between, but in the end still fails. Strange:

8GB allocation spikes

The other performance characteristics are as can be expected based on earlier observations.

Conclusion, questions

With the last exercise above, I found out something I hadn't expected: the F# compiler only calls the non-generic IEnumerator.Current and never calls the IEnumerator<T>.Current. This may explain in part why the performance degradation with chained sequence filters is so noticeable when the object you are performing it on is a value type: the boxing places it on the heap and back again, which is terrible.

Why is the compiler not using the generic interface?
How come that the for-loop is so slow, what happens internally here? Isn't it supposed to turn into a tail-call, which is then compiled into a fast loop internally?
Is there a more natural or other way of writing a filter like I did (mapi, then filter) that does not have the drawbacks of the detrimental performance I described?
Why is there such a big difference between piping the string directly (slow) and string.AsEnumerable() (fast)?

I have many more questions, but the SO-format generally wants you to ask a simple single question only, which I clearly didn't. Sorry to have been so elaborate, I hope I doesn't put too many people off to come with some insightful observations.

UPDATE: as pointed out in the comments, the boxing seems to only appear when run from FSharp Interactive (FSI). If you take stocp4 and change the calling code by adding a redundant Seq.filter ((<>) 0) (or something similar), it will instead call the unboxed accessor. Why? No idea.

Solution

Ok I'll take a shot. All code and benchmark results can be found here.

Lazy v Eager Seqs are slow. Comprehensions are slow. They are a convenient abstraction that involves a lot of compiler-generated goop and allocations, and should generally just be avoided altogether if perf is important. All of the impls in question are handily beat by below simple non-lazy solution.

// ~50% faster for given test case
// still ~20% faster even for length 1.5M string
let eager1 (value: string) =
    let result = ResizeArray(value.Length)
    for i in 0 .. value.Length - 1 do
        if not (Char.IsLowSurrogate(value.[i]))
        then result.Add(Char.ConvertToUtf32(value, i))
    result.ToArray()

Generic v Non Your generic code is being invoked in the benchmark function.

Add a logging statement to both .Current impls, and pipe your output sequence to |> Seq.iter (printfn "%d"), and you will see it's the generic one that's invoked.

Were you testing in FSI? For whatever reason, FSI's "print a few elements of this sequence to the console" code does wind up in the non-generic path, but that doesn't affect the executing code. Maybe that's what you were seeing?

Loops in seq{ } Loops inside of seq { } and other computation expressions are not regular loops. (in fact pretty much nothing "normal looking" inside of computation expressions is actually normal, which is kind of the point :)) As indicated in the computation expression docs, a for loop ends up codgening as an iteration over another enumerable. while loops are a bit simpler.

This more or less explains why your "attempt 1" is so much slower - the for loop results in allocation and iteration of yet another seq inside of your seq.

Piping through Seq APIs Yes, this will create new seqs at each step. If the "real work" involved is very tiny like this example, then overhead starts to dominate.

Getting faster Your subsequent optimizations each remove layers of abstraction, and so although I don't have precise explanations, it seems reasonable that they get a bit faster.

.AsEnumerable() That's pretty wacky, I can repro the significant speedup that you see. Very strange given that the AsEnumerable extension method does nothing but return its input directly!

The structure of the generated code in these case is very different. Maybe this is a pathological case in the optimizer somehow. Interesting find.

Variations I found that results vary quite significantly when you enable/disable optimizations, and when you target x64 vs x86. Take that for what it's worth.

Update after changed benchmarks and requirements from OP

Array.toSeq It is unnecessary to use Array.toSeq here, and will predictably drag down performance of my suggested solution. Array.toSeq and Seq.ofArray are there more for safety (make sure resulting seq can't be converted back to array by consumer and mutated), than type conversion.

Better choices:

Simply cast the array to seq<_> when returning it
Update your other APIs to accept the flexible type #seq<'t>, then even a plain array is ok

Updated Requirements Given newly-revealed constraints:

Processing strings so large that even 1 or 2 copies will cause OOM
Frequent early bail-out when processing

then it's clear a lazy approach is going to be more appropriate, at least in some cases.

Yet even with those requirements, in my testing with your new benchmarks, non-lazy solutions still perform very well in all cases except OOM or huge input with early bailout.

See my gist linked above for results. It includes alternative non-lazy implementations:

let eager2 (value: string) =
    let result = ResizeArray(value.Length)
    for i in 0 .. value.Length - 1 do
        if not (Char.IsLowSurrogate(value.[i]))
        then result.Add(Char.ConvertToUtf32(value, i))
    // cast result so that return type isn't array
    (result.ToArray()) :> seq<_>

let eager3 (value: string) =
    let result = ResizeArray(value.Length)
    for i in 0 .. value.Length - 1 do
        if not (Char.IsLowSurrogate(value.[i]))
        then result.Add(Char.ConvertToUtf32(value, i))
    // ToArray() causes another copy to be generated.
    // Avoiding that is a win in large-input scenarios, but at a cost
    // of otherwise slower processing
    (result) :> seq<_>

Improving lazy solution

Here is a further optimization of the lazy approach, directly integrating all logic, avoiding use of the string enumerator, and avoiding recursion.

This guy actually seems to beat the non-lazy solutions in most cases!

let lazy5 (value : string) =         
    let inline getEnum() = 
        let i = ref -1
        { new IEnumerator<int> with
              member __.Current = Char.ConvertToUtf32(value, !i)
          interface System.Collections.IEnumerator with
              member __.Current =  box (Char.ConvertToUtf32(value, !i))
              member __.MoveNext() = 
                      incr i
                      if !i >= value.Length then false else
                      if not (Char.IsLowSurrogate(value.[!i])) then true else
                      incr i
                      !i < value.Length                  
              member __.Reset() = failwith "reset"
          interface IDisposable with
              member __.Dispose() = () }
    { new IEnumerable<int> with
          member __.GetEnumerator() = getEnum()
      interface IEnumerable with
          member __.GetEnumerator() = getEnum() :> IEnumerator }

Summary

The first while-based seq solution looks great and performs well given constraints. I've tried to give some context on why proposed alternatives might be slower, hopefully that's helpful. I managed to squeeze out a bit more perf by integrating everything into an explict IEnumerable directly.

Depending on constraints and input, a non-lazy solution might be a good choice. I've suggested a few options here. As always, you will need to test in your real environment.

Performance of sequences with while vs. for-do comprehensions, compared to direct `IEnumerable<T>` implementation

Attempt 1: for-do

Attempt 2: Seq.mapi

Attempt 3: Seq.mapiAndFilter

Attempt 4: improved Seq.mapiAndFilter

Performance comparison overview

(Update) Performance comparison incl element access

(Update) Performance comparison incl limited element access

Conclusion, questions

Attempt 1: `for-do`

Attempt 2: `Seq.mapi`

Attempt 3: `Seq.mapiAndFilter`

Attempt 4: improved `Seq.mapiAndFilter`