Search code examples
iosswiftmultithreadingconcurrencycoreml

CoreML can't work in concurrency (Multithreading)?


Despite my best efforts to make CoreML MLModel process its predictions in parallel, seems like under-the-hood Apple forcing it to run in a serial/one-by-one manner.

I made a public repository reproducing the PoC of the issue: https://github.com/SocialKitLtd/coreml-concurrency-issue.

What I have tried:

  • Re-create the MLModel every time instead of a global instance
  • Use only .cpuAndGpu configuration

What I'm trying to achieve:
I'm trying to utilize multithreading to process a bunch of video frames at the same time (assuming the CPU/RAM can take it) faster than the one-by-one strategy.

Code (Also presented in the repository):

class ViewController: UIViewController {

    override func viewDidLoad() {
        super.viewDidLoad()
        let parallelTaskCount = 3
        
        for i in 0...parallelTaskCount - 1 {
            DispatchQueue.global(qos: .userInteractive).async {
                let image = UIImage(named: "image.jpg")!
                self.runPrediction(index: i, image: image)
            }
        }
    }

    
    func runPrediction(index: Int, image: UIImage) {
        let conf = MLModelConfiguration()
        conf.computeUnits = .cpuAndGPU
        conf.allowLowPrecisionAccumulationOnGPU = true
        
        let myModel = try! MyModel(configuration: conf)
        let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
        // Prediction
        let predicition = try! myModel.prediction(input: myModelInput)
        print("finished proccessing \(index)")
    }
    
}

Any help will be highly appreciated.


Solution

  • When you employ parallel execution on CPU, you can achieve significant performance gains on CPU-bound calculations only. But CoreML is not a CPU-bound. When you leverage the GPU (e.g., with .cpuAndGPU) you will not achieve the same sort of CPU-parallelism-driven performance gains that you will see with CPU-only calculations.

    That having said, using the GPU (or the neural engine) is so much faster than the parallelized CPU rendition, as one would generally forgo parallelized CPU calculations, altogether, and favor the GPU rendition. The non-parallel GPU compute rendition will often be faster than a parallelized CPU rendition.

    That having been said, when I employed parallelism in GPU-compute tasks, there is still some modest performance gain. In my experiments, I saw minor performance benefits (13 and 18% faster when I went from serial to three concurrent operations on iPhone and M1 iPad, respectively) running GPU-based CoreML calculations, but slightly more material benefits on a Mac. Just do not expect a dramatic performance improvement.


    Profiling with Instruments (by pressing command-i in Xcode or choosing “Product” » “Profile”) can be illuminating. See Recording Performance Data.

    First, let us compare a computeUnits of .cpuOnly scenario first. Here it is running 20 CoreML prediction calls sequentially (with maxConcurrentOperationCount of 1):

    enter image description here

    And, if I switch to the CPU view, I can see that it is jumping between two performance cores on my iPhone 12 Pro Max:

    enter image description here

    That makes sense. OK, now let us change the maxConcurrentOperationCount to 3, the overall processing time (the processingAll function) drops from 5 to 3½ minutes:

    enter image description here

    And when I switch to the CPU view, to see what is going on, it looks like it started running on both performance cores in parallel, but switched to some of the efficiency cores (probably because the thermal state of the device was getting stressed, which explains we did not achieve anything close to 2× performance):

    enter image description here

    So, when doing CPU-only CoreML calculations, parallel execution can yield significant benefits. That having been said, the CPU-only calculations are much slower than the GPU calculations.


    When I switched to .cpuAndGPU, the difference maxConcurrentOperationCount of 1 vs 3 was far less pronounced, taking 45 seconds when allowing three concurrent operations and 50 seconds when executing serially. Here it is running three in parallel:

    enter image description here

    And sequentially:

    enter image description here

    But in contrast to the .cpuOnly scenarios, you can see in the CPU track, that the CPUs are largely idle. Here is the latter with the CPU view to show the details:

    enter image description here

    So, one can see that letting them run on multiple CPUs does not achieve much performance gain as this is not CPU-bound, but is obviously constrained by the GPU.


    Here is my code for the above. Note, I used OperationQueue as it provides a simple mechanism to control the degree of concurrency (the maxConcurrentOperationCount:

    import os.log
    
    private let poi = OSLog(subsystem: "Test", category: .pointsOfInterest)
    

    and

    func processAll() {
        let parallelTaskCount = 20
    
        let queue = OperationQueue()
        queue.maxConcurrentOperationCount = 3          // or try `1`
    
        let id = OSSignpostID(log: poi)
        os_signpost(.begin, log: poi, name: #function, signpostID: id)
    
        for i in 0 ..< parallelTaskCount {
            queue.addOperation {
                let image = UIImage(named: "image.jpg")!
                self.runPrediction(index: i, image: image, shouldAddContuter: true)
            }
        }
    
        queue.addBarrierBlock {
            os_signpost(.end, log: poi, name: #function, signpostID: id)
        }
    }
    
    func runPrediction(index: Int, image: UIImage, shouldAddContuter: Bool = false) {
        let id = OSSignpostID(log: poi)
        os_signpost(.begin, log: poi, name: #function, signpostID: id, "%d", index)
        defer { os_signpost(.end, log: poi, name: #function, signpostID: id, "%d", index) }
    
        let conf = MLModelConfiguration()
        conf.computeUnits = .cpuAndGPU                 // contrast to `.cpuOnly`
        conf.allowLowPrecisionAccumulationOnGPU = true
        
        let myModel = try! MyModel(configuration: conf)
        let myModelInput = try! MyModelInput(LR_inputWith: image.cgImage!)
        // Prediction
        let prediction = try! myModel.prediction(input: myModelInput)
        os_signpost(.event, log: poi, name: "finished processing", "%d %@", index, prediction.featureNames)
    }
    

    Note, above I have focused on CPU usage. You can also use the “Core ML” template in Instruments. E.g. here are the Points of Interest and the CoreML tracks next to each other on my M1 iPad Pro (with maxConcurrencyOperationCount set to 2 to keep it simple):

    enter image description here

    At first glance, it looks like CoreML is processing these in parallel, but if I run it again with maxConcurrencyOperationCount of 1 (i.e., serially), the time for those individual compute tasks is shorter, which suggests that in the parallel scenario that there is some GPU-related contention.

    Anyway, in short, you can use Instruments to observe what is going on. And one can achieve significant improvements in performance through parallel processing for CPU-bound tasks only, and anything requiring the GPU or neural engine will be further constrained by that hardware.