Why do we not get a performance improvement with PSeq?

Question

We figured that processing each "tail permutation" in parallel would decrease the real running time of this algorithm. Why does it not? How if at all can we configure PSeq to improve the running time?

let rec permute (xs:'t list) = 
    if xs.Length = 1 then [xs]
    else ([], permute xs.Tail)
        // Why do we not get a performance 
        // improvement when we use PSeq here
        // relative to using Seq?
        ||> PSeq.fold (fun acc ps -> 
            (insertAtEach xs.Head ps)@acc)

permute ["A";"B";"C"] |> ignore

We figured that the permutations work would be split as follows, and that the insertAtEach phase of the algorithm would work in parallel.

                        [C]

             [BC]                [CB]

[ABC] [BCA] [BCA]                [ACB] [CAB] [CBA]

            CPU01                CPU02

It is actually slower in parallel even when we use a large initial list such as permute [0..9]. We have not not figured out whether withDegreeOfParallelism and related PSeq options will help.

Aside

Here is the rest of the code listing:

let put a q xs =
    ([], xs)
        ||> Seq.fold (fun acc x ->
            if x = q then a::x::acc
            else x::acc)
        |> List.rev

// Insert x at each possible location in xs
let insertAtEach a xs =
    ([a::xs], xs)
        ||> Seq.fold (fun acc x ->
            (put a x xs)::acc)

Solution

PSeq.fold doesn't do what you think it does. PSeq.fold is actually not parallel at all. It's sequential.

You can't just throw the word "parallel" somewhere in the middle of your algorithm and hope for the best. That's not how parallelization works. You have to actually understand what's going on, what happens in parallel, what is sequential, what can be parallelized in principle, and what cannot.

Take the fold for example: it applies the provided folding function to every element of the sequence and the result of the previous call. Since every next call must have the result of the previous call before it can execute, it's pretty obvious that fold cannot be carried out in parallel at all. Has to be sequential. And in fact, that's what PSeq.fold actually does if you look at its source code. So you get some overhead for converting to a ParallelSequence every time, but no gain.

Now, if you look at your algorithm carefully, you can tease out the parallelizable part. What your algorithm does:

Recursively compute permutations of the tail.
For every one of those permutations, explode it by inserting the head at every index.
Concatenate all results of step 2.

When you put it like this, it's easy to see that step 2 doesn't really depend on anything but itself. Every tail permutation is processed completely separately from all others.

Of course, this isn't evident just from looking at your source code, because you've combined steps 2 and 3 in one expression (insertAtEach xs.Head ps)@acc, but it's easy to tease apart using the following general identity:

xs |> fold (fun a x -> g a (f x))
    ===
xs |> map f |> fold (fun a x -> g a x)

That is, instead of applying function f to every element x during the fold, you can apply it to every element "in advance" using map.

Applying this idea to your algorithm, you get:

let rec permute (xs:'t list) = 
    if xs.Length = 1 then [xs]
    else permute xs.Tail 
         |> Seq.map (insertAtEach xs.Head)
         |> Seq.fold (fun acc ps -> ps@acc) []

Now it's easy to see that the Seq.map step is the parallelizable one: map applies the function to each element independently of other elements, so it can in principle work in parallel. Just replace Seq with PSeq and you got yourself the parallel version:

let rec permute (xs:'t list) = 
    if xs.Length = 1 then [xs]
    else permute xs.Tail 
         |> PSeq.map (insertAtEach xs.Head)
         |> PSeq.fold (fun acc ps -> ps@acc) []

Whether or not PSeq actually does perform map in parallel is another matter, but that should be easily verifiable empirically.

And indeed, on my machine, the parallel version consistently outperforms the sequential version for lists of 7 to 10 elements (11-element list caused an OOM).

P.S. keep in mind that, when measuring time, you need to force the resulting sequence first (e.g. by converting it to a list or taking Seq.last). Otherwise all you're measuring is just the parallelization overhead cost.

P.P.S. here's a gist with my benchmark code.