arrays swift concurrency swift-concurrency

Concurrently access non-overlapping subranges of Array in Swift

Swift 5.9

I'm trying to concurrently sort some number of non-overlapping subranges of an array. My approach thus far looks something like this:

// Not the real array setup, but sufficiently representative
let arraySize = 1_000_000
var someArray = [Int]()
someArray.reserveCapacity(arraySize)
for _ in 0..<arraySize {
    someArray.append(.random(in: 0...1_000_000))
}

// Let's say we want to break the array into 6 SubSequences of
// roughly equal size and sort each one
let divisions = 6
var divisionLength = arraySize / divisions
// Guarantee that the last division will be at least as large
// as the remainder of the array
if arraySize % divisions != 0 { divisionLength += 1 }

await withTaskGroup(of: Void.self) { taskGroup in
    var currentIndex = 0

    while currentIndex < someArray.endIndex {
        let sliceStart = currentIndex
        let sliceEnd = min(currentIndex + divisionLength, someArray.endIndex)
        currentIndex = sliceEnd

        taskGroup.addTask {
            // compilation error: mutation of captured var 'someArray'
            // in concurrently-executing code

            someArray[sliceStart..<sliceEnd].sort()
        }
    }

    await taskGroup.waitForAll()
}

I have a sneaking suspicion that, due to the way Swift handles mutating value types, there doesn't exist a mechanism to say "Trust Me Bro™️" with regards to concurrent mutability.

This based on my understanding that even if I abstract the sort function & mutex the ranges on which it operates, CoW would invalidate the references being held by the other threads.

Still, I would like to know for certain - does there exist some kind of concurrency safety override that would allow me to do this? Or, is there another approach entirely?

Solution

There are two basic approaches:

You can make the original array immutable, thereby solving the unsafe concurrent access. But you would then have each task return a new sorted array of its relevant slice. Because these tasks can complete in a random order, you can have the task group return a tuple of the “division” as well as its sorted subarray, so we can collate the results at the end:

let arraySize = 60
let array: [Int] = (0 ..< arraySize).map { _ in .random(in: 1000...9999) }

var divisions = defaultDevisions
var (divisionLength, remainder) = size.quotientAndRemainder(dividingBy: divisions)
if remainder != 0 { divisionLength += 1 }
(divisions, remainder) = size.quotientAndRemainder(dividingBy: divisionLength)
if remainder != 0 { divisions += 1 }

let result = await withTaskGroup(of: (Int, [Int]).self) { group in
    for division in 0 ..< divisions {
        let start = division * divisionLength
        let end = min((division + 1) * divisionLength, array.endIndex)

        group.addTask {
            (division, array[start..<end].sorted())
        }
    }

    let dictionary: [Int: [Int]] = await group.reduce(into: [:]) { $0[$1.0] = $1.1 }
    return (0 ..< divisions).flatMap { dictionary[$0]! }
}

This obviously has more overhead (both in space and time). But it illustrates a common thread-safety pattern, namely immutable value types.

You can abandon Array and instead use an unsafe pointer/buffer, e.g., UnsafeMutableBufferPointer:

let size = 60
let pointer = UnsafeMutableBufferPointer<Int>.allocate(capacity: size)
pointer.initialize(from: (0 ..< size).map { _ in .random(in: 1000 ... 9999) })

defer {
    pointer.deinitialize()
    pointer.deallocate()
}

var divisions = defaultDevisions
var (divisionLength, remainder) = size.quotientAndRemainder(dividingBy: divisions)
if remainder != 0 { divisionLength += 1 }
(divisions, remainder) = size.quotientAndRemainder(dividingBy: divisionLength)
if remainder != 0 { divisions += 1 }

await withTaskGroup(of: Void.self) { group in
    for division in 0 ..< divisions {
        let start = division * divisionLength
        let end = min((division + 1) * divisionLength, size)

        group.addTask {
            pointer[start..<end].sort()
        }
    }
}

But note that when using unsafe pointers, (a) you are vouching for the “unsafe” access to this buffer; and (b) you bear responsibility for manually deinitializing and deallocating the pointer when you are done with it.

For giggles and grins, I benchmarked the Array and UnsafeMutableBufferPointer approaches using a divisions count of 20 (a M1 Ultra), and this is how many seconds each sort took. I also added the benchmark for a serial implementation, for the sake of comparison:

Array size	Parallel array	Parallel raw buffer	Serial
64	0.000069	0.000051	0.000009
128	0.000095	0.000073	0.000022
256	0.000076	0.000050	0.000025
512	0.000068	0.000060	0.000018
1,024	0.000089	0.000059	0.000033
2,048	0.000156	0.000102	0.000124
4,096	0.000195	0.000146	0.000197
8,192	0.000220	0.000212	0.000456
16,384	0.000438	0.000233	0.000861
32,768	0.000573	0.000305	0.001704
65,536	0.000998	0.000579	0.003271
131,072	0.001121	0.000754	0.005821
262,144	0.002095	0.001288	0.011233
524,288	0.003953	0.002360	0.024545
1,048,576	0.008017	0.004527	0.052295
2,097,152	0.017985	0.009507	0.112165
4,194,304	0.031620	0.018867	0.242490
8,388,608	0.069403	0.039062	0.516160
16,777,216	0.116921	0.071195	1.094520
33,554,432	0.239149	0.154589	2.308661
67,108,864	0.507110	0.311157	4.903246
134,217,728	1.077651	0.626984	10.333774

Or, below is a similar analysis using Swift Collections Benchmark. But, whereas the above table was isolating the performance of the sorting only, the chart below includes the preparation of the array of random values, making the overall performance difference between the serial and parallel algorithms far less stark. But it still conveys the same basic story:

The above was generated by this repo.

Obviously, these results will vary based upon the hardware, the complexity of the sort, etc., but I draw a few conclusions from the above two sets of benchmark results:

On my hardware, I had to have an array with 8k items before there was any difference (with fewer items than that, the serial rendition can actually be faster).
I needed approximately more than 1m items before the difference started to be user-observable.
With less than a million items, there is simply not enough work on each thread to materially offset the modest overhead that parallelism introduces.

More broadly, we can conclude:

The array approach, which looks like it should be far less efficient, actually isn't that bad (and enjoys far simpler memory semantics).
The raw buffer pointer technique can be a little faster for the actual sort. (Note, my benchmarks above do not include the possible conversion of an array to an unsafe buffer, so that might offset any benefits we see if you are starting with an array.)

I personally only use this buffer pointer technique if already in the world of unsafe buffers (e.g., pixel buffers), at which point this is very performant.
One needs enough work on each thread to enjoy the benefits of parallelism; simple sorts only enjoy significant benefits if the arrays are huge (or the sort comparison logic is more complicated).