swift multithreading for-loop grand-central-dispatch

Is possible make more efficiency multithreaded code? (double for loop O(n^2))

I use minimum edit distance algorithm for find similar string.

My code subject is find the couple with the closest hobbies among the input data.

Hobby type
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

// * All of the person have 10 hobby. 
// * Hobby can not be duplicated.

// Input Data format
// 1000.txt
1000                   // number of data
N D R P Y X V B M T    // each line is person's hobbies
P J Z E X W H S M N
F U G L C T Q S O P
A D Q S F E N P Y G
K P D F G Q U I V H
D E I S N L C K H G
D I R K J V Q H M Z
Y A R D F T N P W O
R A G Z F M J K I Y
P E K W H S F Z G R
U J O T I B R Y E H
M A Z N S J H X T P
...
...

// This is Couple Model
struct Couple {
  let firstIdx: Int
  let firstArray: String

  let secondIdx: Int
  let secondArray: String

  let value: Int
}


// This function finds the couple with the closest hobbies among the input data.
func findCouple() {
    guard
      // Utility.makeData's return value is purified Input data. ex) N D R P Y X V B M T -> BDMNPRTVYX
      let data = Utility.makeData(using: "1000") 
      let count = Int(data[0]) else { return }

    var couples = [Couple]()
    var min = data[1].count

    // make GCD Group for handle each queue.
    let group = DispatchGroup()  

    for i in 1..<count {
      // Pivot for hobby compare
      let hobby = data[i]

      // make custom queue for multi-threading
      let queue = DispatchQueue(label: "idx.\(i).queue", attributes: .concurrent)

      queue.async(group: group) {
        for j in (i + 1)..<data.count {
          // This is the subject of comparison.
          let hobby2 = data[j]

          // Calculate for find similarly string
          let newMin = hobby.minimumEditDistance(other: hobby2)

          queue.async(group: group, qos: .userInitiated, flags: .barrier) {
            // If find more similarly string bundle
            if min >= newMin {
              min = newMin
              // Store the couple
              couples.append(
                Couple(firstIdx: i, firstArray: hobby, secondIdx: j, secondArray: hobby2, value: min)
              )
            }
          }
        }
      }
    }

    group.notify(queue: DispatchQueue.global()) {
      let similarCouples = couples.filter({$0.value == min}
      // I want to print like
      // 1-3 : index of persons
      // ASDFZXCVNM : 1 persons's hobby
      // ASDFXCVBJK : 2 persons's hobby
    }
}

If the size of the input data is large enough (10000, or more), performance of my function is worst.(very slowly)

Please let me know if there are any improvements.

Solution

You can undoubtedly improve greatly upon your initial attempt. Frankly, as surprising as it might seem, there’s a serious risk that your current rendition might even be less efficient than the non-parallelized version. Benchmark both and compare.

Regarding the issues, they are manifold:

Having more async calls is not always better. Unbridled async calls are going to introduce all sorts of inefficiencies. First, the number of worker threads is quite limited (currently 64). Second, there’s no utility in exceeding the number of available cores on your device, anyway.

Rather than these async calls, we often reach for concurrentPerform. This is often the most efficient way to parallelize a for loop. This will never exceed the number of available concurrent threads permitted by your hardware.

So, consider the non-parallelized rendition:
```
for i in 0 ..< 9_999 {
    for j in i+1 ..< 10_000 {
        ...
    }
}
```
We'd parallelize it with simply:
```
DispatchQueue.concurrentPerform(iterations: 9_999) { i in
    for j in i+1 ..< 10_000 {
        ...
    }
}
```
And, because concurrentPerform blocks the thread from which you call it, you’d likely dispatch the whole thing to a background queue:
```
DispatchQueue.global().async {
    DispatchQueue.concurrentPerform(iterations: 9_999) { i in
        for j in i+1 ..< 10_000 {
            ...
        }
    }

    DispatchQueue.main.async {
        // now update UI with the results of the above
    }
}
```
Note, I’m not using any async calls inside the concurrentPerform, as it does all the the parallelization of the outer for loop for us. Anyway, this pattern, I’m effectively doing 10,000 async calls (not 50m of them). And concurrentPerform ensures that I don’t have a thread explosion while I do that.
Frankly, though, even 10,000 iterations is far too much. There’s not enough work per thread and the per-thread overhead will add up. In fact, the benefits of parallelizing the routine will be adversely offset by all of this overhead.

The typical solution is to “stride” through the calculations. For example, with 10,000 iterations, one might stride through, doing 200 iterations, for 50 values of i each, or something like that. For example:
```
let stride = 50
DispatchQueue.concurrentPerform(iterations: 200) { iteration in
    let startIndex = iteration * stride
    let endIndex = min(9_999, startIndex + stride)
    for i in startIndex ..< endIndex {
        for j in i+1 ..< strings.count {
            ...
        }
    }
}
```
So, having previously dropped from 50m async calls to a 10k iterations of concurrentPerform, we’re now down to only 200 iterations. That’s more than enough to achieve great parallelized performance gains, but with negligible overhead.
Your code is not thread-safe. You are updating min and couples from multiple threads. Yes, you’re using a barrier, but each loop has its own queue, and the barrier is only for that current queue, not across queues. You have a data race. You must synchronize your access to these variables.

Because there is overhead associated with synchronization, I might suggest that you go a step further and figure out how to minimize the synchronization needed. For example, each dispatched block might calculate its own local value of min distance, and then at the end of the block, only then should you check the local min distance against the master min distance.

Those are a few tips to help you maximize your parallel calculation performance. On my MacBookPro, the above yielded a calculation that was 9.9× faster than the non-parallelized rendition.