concurrency go parallel-processing quicksort goroutine

Worse performance when using go routines in parallel quicksort implementation

Note: The "Go-lang parallel segment runs slower than series segment" question dealt with race conditions, this one has another issue, so imho it's not a duplicate.

I'm trying to find an explanation for the following situation: Running parallel quicksort results in a significantly longer runtime when done using go routines.

Benchmarks are after the code:

package c9sort

import (
    "time"
)

var runInParllel bool

func Quicksort(nums []int, parallel bool) ([]int, int) {
    started := time.Now()
    ch := make(chan int)
    runInParllel = parallel

    go quicksort(nums, ch)

    sorted := make([]int, len(nums))
    i := 0
    for next := range ch {
        sorted[i] = next
        i++
    }
    return sorted, int(time.Since(started).Nanoseconds() / 1000000)
}

func quicksort(nums []int, ch chan int) {

    // Choose first number as pivot
    pivot := nums[0]

    // Prepare secondary slices
    smallerThanPivot := make([]int, 0)
    largerThanPivot := make([]int, 0)

    // Slice except pivot
    nums = nums[1:]

    // Go over slice and sort
    for _, i := range nums {
        switch {
        case i <= pivot:
            smallerThanPivot = append(smallerThanPivot, i)
        case i > pivot:
            largerThanPivot = append(largerThanPivot, i)
        }
    }

    var ch1 chan int
    var ch2 chan int

    // Now do the same for the two slices
    if len(smallerThanPivot) > 1 {
        ch1 = make(chan int, len(smallerThanPivot))
        if runInParllel {
            go quicksort(smallerThanPivot, ch1)
        } else {
            quicksort(smallerThanPivot, ch1)
        }
    }
    if len(largerThanPivot) > 1 {
        ch2 = make(chan int, len(largerThanPivot))
        if runInParllel {
            go quicksort(largerThanPivot, ch2)
        } else {
            quicksort(largerThanPivot, ch2)
        }
    }

    // Wait until the sorting finishes for the smaller slice
    if len(smallerThanPivot) > 1 {
        for i := range ch1 {
            ch <- i
        }
    } else if len(smallerThanPivot) == 1 {
        ch <- smallerThanPivot[0]
    }
    ch <- pivot

    if len(largerThanPivot) > 1 {
        for i := range ch2 {
            ch <- i
        }
    } else if len(largerThanPivot) == 1 {
        ch <- largerThanPivot[0]
    }

    close(ch)
}

Benchmarks for a random perm of 500000 integers:

Ran 100 times

Non parallel average - 1866ms

Parallel average - 2437ms

Any explanation would be appreciated. I know goroutines may not be best for this kind of parallelism, but I'm trying to understand the reason.

Thank you in advance.

Solution

Turns out it was very simple. As I'm on a new machine, the GOMAXPROCS variable wasn't set.

The new benchmark favors, as predicted, the parallel implementation: Set to double the number of cores:

Using 16 goroutines
Ran 100 times

Non parallel average - 1980
Parallel average - 1133

Set to the number of cores:

Using 8 goroutines
Ran 100 times

Non parallel average - 2004
Parallel average - 1197

By the way, this is fairly consistent. The average for double the number of cores is always a bit better.

Benchmark for a larger collection (1000000):

Using 8 goroutines
Ran 100 times

Non parallel average - 3748
Parallel average - 2265

With double:

Using 16 goroutines
Ran 100 times

Non parallel average - 3817
Parallel average - 2012